Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Observing network dynamics through sentinel nodes

Neil G. MacLaren1    Baruch Barzel2,3,4    Naoki Masuda1,5,6,∗ 1Department of Mathematics, State University of New York at Buffalo, NY 14260-2900, USA 2Department of Mathematics, Bar-Ilan University, Ramat-Gan, 5290002, Israel 3The Gonda Multidisciplinary Brain Research Center, Bar-Ilan University, Ramat-Gan, 5290002, Israel 4Network Science Institute, Northeastern University, Boston, MA 02115, USA 5Institute for Artificial Intelligence and Data Science, State University of New York at Buffalo, USA 6Center for Computational Social Science, Kobe University, Kobe, 657-8501, Japan naokimas@buffalo.edu
(July 31, 2024)
Abstract

A fundamental premise of statistical physics is that the particles in a physical system are interchangeable, and hence the state of each specific component is representative of the system as a whole. This assumption breaks down for complex networks, in which nodes may be extremely diverse, and no single component can truly represent the state of the entire system. It seems, therefore, that to observe the dynamics of social, biological or technological networks, one must extract the dynamic states of a large number of nodes—a task that is often practically prohibitive. To overcome this challenge, we use machine learning techniques to detect the network’s sentinel nodes, a set of network components whose combined states can help approximate the average dynamics of the entire network. The method allows us to assess the state of a large complex system by tracking just a small number of carefully selected nodes. The resulting sentinel node set offers a natural probe by which to practically observe complex network dynamics.

I Introduction

The dynamic state of a complex networked system is given in terms of the microscopic states xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of all its components (nodes) [1, 2, 3]. For example, the functionality of a cell can be captured by the expression levels of all individual genes [4, 5, 6]. Similarly, the state of an ecosystem is often characterized by the abundance of all species [7, 8]. Therefore, to observe the state of the system, one must simultaneously track the individual states of its many nodes—a level of empirical control that we seldom possess.

We, therefore, seek efficient reduction methods, that help track the state of the system in a more compact fashion. In most applications, such reduction is achieved by focusing on the average activity of all nodes, i.e. x=i=1Naixidelimited-⟨⟩𝑥superscriptsubscript𝑖1𝑁subscript𝑎𝑖subscript𝑥𝑖\langle x\rangle=\sum_{i=1}^{N}a_{i}x_{i}⟨ italic_x ⟩ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where the weights {a1,,aN}subscript𝑎1subscript𝑎𝑁\{a_{1},\ldots,a_{N}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } characterize the importance of each node in assessing the system’s dynamic state [9, 10, 11, 12, 13, 14, 15]. For example, selecting ai=kisubscript𝑎𝑖subscript𝑘𝑖a_{i}=k_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, a node’s weighted degree, leads to the Gao-Barzel-Barabási (GBB) reduction [9], which helps predict the critical points of transition between dynamic states. Alternatively, under the dynamics approximate reduction technique (DART), the weight aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the i𝑖iitalic_ith entry of the leading eigenvector of the network’s adjacency matrix [10, 11].

While these reduction methods provide crucial theoretical insight, allowing us to analytically predict the system’s critical points of transition, they offer limited advances on observability. This is because measuring xdelimited-⟨⟩𝑥\langle x\rangle⟨ italic_x ⟩ via GBB or DART, one still needs to access the state of all nodes x1,,xNsubscript𝑥1subscript𝑥𝑁x_{1},\dots,x_{N}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, thus remaining beyond the practical bounds of empirical accessibility. More selective probes have been introduced in the context of early warning signals, where one follows just a small number of nodes that show signs of looming transitions early on, prior to the rest of the network [16, 17, 18, 19, 20]. The problem is that such early warning nodes are, by design, selected thanks to their divergent behavior around the transition points. This renders them unlikely to be a representative sample of the entire networked system.

How then can we gain the predictive power of GBB or DART, but with a node set that is comparable to the few early warning signifiers? To solve this, here we use machine learning to impose a sparsity condition on {a1,,aN}subscript𝑎1subscript𝑎𝑁\{a_{1},\ldots,a_{N}\}{ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, requiring that ai0subscript𝑎𝑖0a_{i}\neq 0italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 only for a small number of nodes. The result is a set of sentinel nodes, whose average allows us to approximate the state of the entire system. We find that, with only a few strategically selected nodes, we can achieve network-wide observability.

Our analysis shows that the ideal sentinel node sets tend to sample the network heterogeneity, for example, comprising a range of low, average and high degrees. This indicates that the optimal combination of sentinel nodes is mainly determined by the network topology, largely independent of the system’s specific dynamics. Consequently, one can extract the sentinels even without a priori knowledge about the system’s hidden dynamics. This allows practical observability even under extremely restrictive conditions.

II Results

II.1 Seeking the sentinel nodes

Consider coupled nonlinear dynamics on a network, such as mutualistic or epidemic dynamics. (See Methods for the dynamics models and networks used.) In many cases, we want to know the mere average of the network activity, i.e., x¯=i=1Nxi/N¯𝑥superscriptsubscript𝑖1𝑁subscript𝑥𝑖𝑁\overline{x}=\sum_{i=1}^{N}x_{i}/Nover¯ start_ARG italic_x end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_N, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the activity of the i𝑖iitalic_ith node, and N𝑁Nitalic_N is the number of nodes in the network. For example, in contagion processes, if we assign xi=0subscript𝑥𝑖0x_{i}=0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 for susceptible and xi=1subscript𝑥𝑖1x_{i}=1italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 for infectious, then the average x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG represents the fraction of infectious individuals in the population. Similarly, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG may represent the average political opinion in a population, the fraction of people aware of news, average biomass in an ecosystem, or the extent of cancerous cells in the body. Quantity x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG is simple, intuitive, and practical, providing a first-hand summary of dynamics on networks.

Denoting by xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the equilibrium of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, our goal is to approximate the average activity of all nodes, i.e.,

x¯=1Ni=1Nxi,¯𝑥1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑥𝑖\overline{x}=\frac{1}{N}\sum_{i=1}^{N}x_{i}^{*},over¯ start_ARG italic_x end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (1)

using a limited set S𝑆Sitalic_S of sentinel nodes, namely,

x¯=1niSxi.superscript¯𝑥1𝑛subscript𝑖𝑆superscriptsubscript𝑥𝑖\overline{x}^{\prime}=\frac{1}{n}\sum_{i\in S}x_{i}^{*}.over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (2)

Here nNmuch-less-than𝑛𝑁n\ll Nitalic_n ≪ italic_N is the number of sentinel nodes in S𝑆Sitalic_S. To obtain S𝑆Sitalic_S, we simulate the system under a range of conditions, e.g., by varying the average edge weight D𝐷Ditalic_D. Parameter D𝐷Ditalic_D is a natural control parameter; as D𝐷Ditalic_D varies, x𝑥xitalic_x in Eq. (1) may change or undergo critical transitions (Fig. 1). We then seek the optimal node set S𝑆Sitalic_S that minimizes

ε==1L(x¯x¯)2L=1Lx¯,𝜀superscriptsubscript1𝐿superscriptsubscriptsuperscript¯𝑥subscript¯𝑥2𝐿superscriptsubscript1𝐿subscript¯𝑥\varepsilon=\frac{\sum_{\ell=1}^{L}(\overline{x}^{\prime}_{\ell}-\overline{x}_% {\ell})^{2}}{L\sum_{\ell=1}^{L}\overline{x}_{\ell}},italic_ε = divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG , (3)

where x¯subscript¯𝑥\overline{x}_{\ell}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT and x¯subscriptsuperscript¯𝑥\overline{x}^{\prime}_{\ell}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT are x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG and x¯superscript¯𝑥\overline{x}^{\prime}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, respectively, at the \ellroman_ℓth value of D𝐷Ditalic_D. Equation (3) captures the total difference between the approximation x¯(D)superscript¯𝑥𝐷\overline{x}^{\prime}(D)over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_D ) and the exact x¯(D)¯𝑥𝐷\overline{x}(D)over¯ start_ARG italic_x end_ARG ( italic_D ) across the L𝐿Litalic_L different conditions that together sample a range of D𝐷Ditalic_D values. It is minimal when x¯superscript¯𝑥\overline{x}^{\prime}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT successfully recovers the system average in Eq. (1), including its observed points of transition.

As an example, we consider the coupled double-well dynamics (Fig. 1a) on the dolphin social network (Fig. 1b) under different coupling strength D𝐷Ditalic_D. When uncoupled, each node has two states, i.e., lower and upper states, as equilibria. We initialize each simulation for each D𝐷Ditalic_D value by placing all the nodes in their lower state. We show the equilibrium state values, xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, for all nodes as a function of D𝐷Ditalic_D by the light gray lines in Fig. 1c–h. The black line represents the average activity of the network, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG. As D𝐷Ditalic_D increases, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG increases in stages with a prolonged intermediate range in which some xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPTs remain in their lower states (i.e., xi<2superscriptsubscript𝑥𝑖2x_{i}^{*}<2italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < 2) and the other xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT have transitioned to their upper states (i.e., xi>5superscriptsubscript𝑥𝑖5x_{i}^{*}>5italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 5). This range of D𝐷Ditalic_D values captures a multistage transition [21, 22, 19].

Refer to caption
Figure 1: Approximations of the average activity, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG, of the coupled double-well dynamics on the dolphin network. (a) Coupled double-well dynamics model. (b) Dolphin network with N=62𝑁62N=62italic_N = 62 nodes. Larger circles marked with color represent the nodes used in sentinel node approximations shown in (c)–(f): orange for n=1𝑛1n=1italic_n = 1, yellow for n=2𝑛2n=2italic_n = 2, green for n=3𝑛3n=3italic_n = 3, and blue for n=4𝑛4n=4italic_n = 4. (c)–(h) Equilibrium states in the dolphin network. The gray lines indicate xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at each value of the control parameter. The black lines indicate x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG. Note that the gray and black lines are the same in panels (c)–(h). The purple lines in (c)–(f) show the sentinel node approximation corresponding to the selected sentinel nodes, whose xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT values are shown in orange (c), yellow (d), green (e), and blue (f). We verified by exhaustive search of every possible combination of nodes that the obtained node sets for n=1𝑛1n=1italic_n = 1 (shown in (c)) and n=2𝑛2n=2italic_n = 2 (shown in (d)) are exact minimizers of ε𝜀\varepsilonitalic_ε. Panels (g) and (h) show the GBB and DART approximations in red and pink, respectively. The brown and dark yellow lines in each plot show the observables (i.e., a particular weighted average of {x1,,xN}superscriptsubscript𝑥1superscriptsubscript𝑥𝑁\{x_{1}^{*},\ldots,x_{N}^{*}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }) that the GBB and DART schemes intend to approximate.

We demonstrate approximating x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG with the sentinel node approximation, x¯superscript¯𝑥\overline{x}^{\prime}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, with n=1𝑛1n=1italic_n = 1, 2222, 3333, and 4444 by the purple line in Fig. 1c, 1d, 1e, and 1f, respectively. The sentinel node approximation correctly identifies the onset of the bifurcation at D0.2𝐷0.2D\approx 0.2italic_D ≈ 0.2 for any n{1,,4}𝑛14n\in\{1,\ldots,4\}italic_n ∈ { 1 , … , 4 } but over- or underestimates x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG depending on D𝐷Ditalic_D in the range of the multistage transition. In general, larger optimized node sets do not include smaller ones as a subset. In this example, there is no overlap between any of the optimized node sets with n=1,2,3𝑛123n=1,2,3italic_n = 1 , 2 , 3, and 4444; we show the selected sentinel nodes in color in Fig. 1b. However, the approximation accuracy progressively improves as one includes more sentinel nodes. Concretely, the approximation errors for n=1𝑛1n=1italic_n = 1, 2222, 3333, and 4444 are ε=0.035𝜀0.035\varepsilon=0.035italic_ε = 0.035, ε=0.020𝜀0.020\varepsilon=0.020italic_ε = 0.020, ε=0.016𝜀0.016\varepsilon=0.016italic_ε = 0.016, and ε=0.004𝜀0.004\varepsilon=0.004italic_ε = 0.004, respectively. Figure 1f demonstrates that, with a microscopic number of nodes (i.e., n=4𝑛4n=4italic_n = 4), the sentinel node approximation captures major components of the multistage transition, resulting in a small approximation error. As expected, further increasing n𝑛nitalic_n results in further reduction in approximation error (see SI section S2); however, for all practical purposes n=4𝑛4n=4italic_n = 4 is already sufficient.

For comparison, we show the results obtained from the GBB and DART approximations in Fig. 1g and 1h, respectively. Even with just n=1𝑛1n=1italic_n = 1 sentinel node, our method performs better than the GBB and DART, reducing ε𝜀\varepsilonitalic_ε by more than 75% (GBB: ε=0.171𝜀0.171\varepsilon=0.171italic_ε = 0.171; DART: ε=0.142𝜀0.142\varepsilon=0.142italic_ε = 0.142). This is mainly owing to the fact that, while GBB and DART track the collective state of the system (i.e., xdelimited-⟨⟩𝑥\langle x\rangle⟨ italic_x ⟩ in the Introduction section) and thus predict a single transition point, our method can discern the multistage nature of the actual observed transition (black solid line). Most crucially, thanks to its machine learning component, our framework achieves the observed predictability with a much smaller set of observations—here just with n=4𝑛4n=4italic_n = 4 nodes (Fig. 1f). In sum, our method realizes a dramatic reduction in complexity along with a comparable, and in the present case even improved, accuracy.

Because our algorithm is stochastic, we ran it 100 times, starting with independent initial node sets, on the same network to assess generality of the results shown in Fig. 1. We set n=4𝑛4n=4italic_n = 4. We show the approximation error, ε𝜀\varepsilonitalic_ε, for the 100 optimized node sets by the blue circles in Fig. 2a. Each circle corresponds to one run. The optimized node set is different in each run in general. However, the ε𝜀\varepsilonitalic_ε values of these optimized node sets are similar to each other, and most importantly they are small. This result supports the robustness of our algorithm with respect to the initial condition and stochasticity of the algorithm.

Refer to caption
Figure 2: Approximation error for the coupled double-well dynamics for node sets with n=lnN𝑛𝑁n=\lfloor\ln N\rflooritalic_n = ⌊ roman_ln italic_N ⌋ nodes. A blue circle represents an optimized node set. An orange triangle represents a degree-preserving random node set. A green square represents a completely random node set. (a) Dolphin network. (b) BA network with N=103𝑁superscript103N=10^{3}italic_N = 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT nodes.

Next, we select two groups of random node sets for comparison. First, we take the best optimized node set among the 100 runs in terms of ε𝜀\varepsilonitalic_ε, denoted by S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG. We then make 100 node sets by randomly selecting n=4𝑛4n=4italic_n = 4 nodes having the same degree sequence as S~~𝑆\tilde{S}over~ start_ARG italic_S end_ARG. For example, if the selected optimized node set is composed of four nodes with degrees 1, 3, 3, and 8, then we sample one node with degree 1 uniformly at random from the network and similarly sample two nodes with degree 3 and one node with degree 8. We call this node set the degree-preserving random node set. We show ε𝜀\varepsilonitalic_ε for each degree-preserving random node set by the orange triangles in Fig. 2a. Second, we make 100 node sets with n=4𝑛4n=4italic_n = 4 by selecting nodes uniformly at random from the entire network. We call this node set the completely random node set. We calculate ε𝜀\varepsilonitalic_ε for each of the 100 completely random node sets, shown by the green squares in Fig. 2a. As expected, the degree-preserving node sets outperform the completely random ones, exhibiting an approximation error that is concentrated on smaller values in general. The crucial point is, however, that our optimized node sets have consistently smaller errors than both alternatives, better than 100% of the random sets, and 89% of the degree-preserving sets. Therefore, while selecting n𝑛nitalic_n nodes with appropriate degrees improves the accuracy of approximation, one can substantially reduce ε𝜀\varepsilonitalic_ε by further optimization beyond only using a good degree sequence.

These results remain similar for other networks; see Fig. 2b for results for a Barabási-Albert (BA) network and SI section S3 for results obtained from eight more networks. In all networks, the optimized node sets yield relatively small ε𝜀\varepsilonitalic_ε.

The patterns described for the coupled double-well dynamics across the various networks also hold true for the mutualistic species, gene-regulatory, and SIS dynamics (see SI section S3). Across every combination of one of the four dynamics and one of the ten networks, the worst optimized node set still has a smaller ε𝜀\varepsilonitalic_ε than 100% of completely random node sets and 95.2% of degree-preserving node sets.

We statistically verified these observations by constructing an analysis of variance (ANOVA) model of the approximation error, ε𝜀\varepsilonitalic_ε. The dependent variable is lnε𝜀\ln\varepsilonroman_ln italic_ε. There are three qualitative independent variables: dynamics (reference: coupled double-well dynamics), network (reference: BA network), and node set type (reference: completely random node sets). We exclude the Erdős-Rényi (ER) network because it yields small ε𝜀\varepsilonitalic_ε regardless of node set type (see SI section S3), presumably because of the small spread of its degree distribution and the relative similarity across all nodes. We show the details of our analysis in SI section S4. Across all dynamics and networks, ε𝜀\varepsilonitalic_ε of the optimized node set is on average more than 200 times smaller than ε𝜀\varepsilonitalic_ε of completely random node sets (p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT) and more than 25 times smaller than ε𝜀\varepsilonitalic_ε of degree-preserving node sets (p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT).

While we have used undirected and unweighted networks, we have verified that these results, including the ANOVA results, remain similar for various weighted networks (see SI section S9) and directed networks (see SI section S10).

II.2 Characteristics of good node sets

We have shown that finding nodes with an appropriate degree sequence is an important but not sufficient condition for finding a good set of sentinel nodes, S𝑆Sitalic_S. In this section, we explore the characteristics of the nodes that our algorithm selects for inclusion in S𝑆Sitalic_S.

Our first observation is that nodes in S𝑆Sitalic_S tend to have a degree sequence that is balanced across the average degree in the entire network. We show this pattern in Fig. 3a for the dolphin network. The top panel of Fig. 3a shows the degree histogram of all nodes in the network. The degree distribution is bimodal, with a predominance of nodes with degree 1 and a secondary group of nodes with degree near the mean degree k¯=5.13¯𝑘5.13\overline{k}=5.13over¯ start_ARG italic_k end_ARG = 5.13. The lower panels of Fig. 3a show the aggregate degree histograms realized by the nodes in S𝑆Sitalic_S obtained from 100 independent runs of the algorithm with n=1𝑛1n=1italic_n = 1, 2222, 3333, and 4444, with one panel corresponding to an n𝑛nitalic_n value. The degrees of the chosen node are sorted such that the blue bars represent the smallest degree node in the optimized node set in the 100 runs, orange the second smallest, red the third smallest, and light blue the largest degree nodes. We find that the algorithm balances the degrees of the selected nodes such that both small-degree and large-degree nodes tend to exist in a node set when n{3,4}𝑛34n\in\{3,4\}italic_n ∈ { 3 , 4 } (and to a lesser extent when n=2𝑛2n=2italic_n = 2). For example, in Fig. 1f, the n=4𝑛4n=4italic_n = 4 nodes selected for S𝑆Sitalic_S have degrees 1, 6, 7, and 8. The xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT values for these four nodes, shown by the four blue lines in Fig. 1f, appear above and below the target values, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG, depending on whether the degree of the node is large and small, respectively. As a result, the average of these four lines provides an accurate approximation to x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG. Qualitatively the same phenomenon also occurs for n=2𝑛2n=2italic_n = 2 and n=3𝑛3n=3italic_n = 3, as shown in Fig. 1d and e, respectively.

Refer to caption
Figure 3: Degree histogram of the original network and that from the nodes in the optimized node set. (a) Dolphin network. (b) BA network. The original degree histograms for the two networks are shown in the top panels of (a) and (b). We show the degree histograms of 100 optimized node sets with different sizes, n𝑛nitalic_n, in the second to the bottom panels in each of (a) and (b). In these panels, the color of the bar represents the order of the node in S𝑆Sitalic_S in terms of the degree (blue: smallest-degree node, orange: second smallest-degree node, red: third node, light blue: fourth node, green: fifth node, yellow: sixth node). The inset in (b) shows the count of nodes with ki>20subscript𝑘𝑖20k_{i}>20italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 20.

We show in Fig. 3b the results of the same analysis for the BA network used in Fig. 2b. The network has hub nodes, i.e., nodes with large degrees. The top hub nodes, with ki=subscript𝑘𝑖absentk_{i}=italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 38, 40, 66, 76, and 81, are indicated by discrete peaks in the inset of the top panel of Fig. 3b. The bottom panels of Fig. 3b indicate that our optimization algorithm does not select these hub nodes in any of the 100 runs for each n{1,2,3,6}𝑛1236n\in\{1,2,3,6\}italic_n ∈ { 1 , 2 , 3 , 6 }. Instead, the algorithm selects nodes the degree of which is close to the mean; note that the mean degree is close to the minimum degree in BA networks. More specifically, the algorithm selects a mixture of nodes with small and moderately large degrees, but avoiding nodes with larger degrees.

The degree histograms of the optimized node sets shown in Fig. 3 appear to markedly differ from that of the entire network. To quantify the difference, we computed the Kullback-Leibler divergence, DKLsubscript𝐷KLD_{\rm KL}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT, between the optimized node sets and the original network in terms of the degree distribution (see SI section S5 for the methods).

We show the DKLsubscript𝐷KLD_{\rm KL}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT values for the dolphin network by the blue circles in Fig. 4a. For comparison, the green squares represent the average DKLsubscript𝐷KLD_{\rm KL}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT between the original degree distribution and completely random degree distributions constructed with the same number of nodes (i.e., 100n100𝑛100n100 italic_n nodes) as that used for constructing the degree distribution from the 100100100100 optimized node sets. Figure 4a indicates that the discrepancy between the optimized node sets and the original networks in terms of the degree distribution is substantially large when n𝑛nitalic_n is small; note the logarithmic scale on the vertical axis. As n𝑛nitalic_n becomes larger, both the optimized and the random node sets tend to become more similar to the original network in terms of the degree distribution. However, the degree distribution of the optimized node sets is significantly different from the original degree distribution at all n{1,,12}𝑛112n\in\{1,\ldots,12\}italic_n ∈ { 1 , … , 12 }. Therefore, although our method tends to select a mix of large- and small- degree nodes, it does not sample the degree values uniformly at random even for relatively large node sets (i.e., up to n=12𝑛12n=12italic_n = 12). We have verified that the optimized node sets have smaller approximation errors than completely random node sets of equivalent size for the same range of n𝑛nitalic_n (see SI section S2), extending the results for the dolphin network shown in Fig. 2a for n=4𝑛4n=4italic_n = 4. It should be noted that the approximation error for both optimized and completely random node sets decreases as n𝑛nitalic_n increases (see SI section S2), which is expected.

Refer to caption
Figure 4: Comparison between the degree distribution for the optimized node sets and the completely random node sets on two networks. We show the Kullback-Leibler divergence, DKLsubscript𝐷KLD_{\rm KL}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT, to the degree distribution of the original network. The blue circles represent the DKLsubscript𝐷KLD_{\rm KL}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT values for the optimized node set. The green squares represent the average DKLsubscript𝐷KLD_{\rm KL}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT over 100 sets of simulations with 100 completely random node sets each; the 95% confidence interval for each mean is smaller than the square marker and therefore not shown. (a) Dolphin network. (b) BA network.

We have also found that the average degree of the nearest neighbors of the nodes in the optimized sentinel node set, sentinel nodes’ local clustering coefficients, and their community membership systematically deviate from these quantities for uniformly randomly selected nodes. However, heuristically constructed node sets exploiting the observed network properties, such as the avoidance of hub nodes (informed by Fig. 3) or multiple nodes from the same community, only marginally, though significantly, decrease the approximation error, ε𝜀\varepsilonitalic_ε, compared to completely random node sets. See SI section S6 for these results. Therefore, we conclude that running our optimization algorithm is a necessary step to realize a high accuracy in approximating x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG.

II.3 Transfer learning: Effectiveness of optimized node sets across different dynamics

Our optimization algorithm does not use just the information on the network structure but also implicitly requires the dynamic model running on the networks. The challenge is that, in practice, we often do not know the actual dynamics that produce the observed data. Therefore, here we ask whether and to what extent our optimization algorithm enables transfer learning, that is, the application of node sets optimized with one dynamics, called the training dynamics, to another dynamics, called the test dynamics.

We again use the dolphin network as an example. In each panel of Fig. 5a, we plot the approximation error, ε𝜀\varepsilonitalic_ε, of each optimized (blue circles), degree-preserving (orange triangles), and completely random (green squares) node set. For each node set, we show the approximation error for the training dynamics on the horizontal axis and the test dynamics on the vertical axis. We also show the average result (large symbols), as obtained for 100 node sets of each type.

As expected, node sets optimized on the training dynamics have consistently smaller ε𝜀\varepsilonitalic_ε when evaluated on the same dynamics than on an alternative test dynamics. However, although ε𝜀\varepsilonitalic_ε has more variance when evaluated on the test dynamics, the node sets optimized under a different training dynamics still outperform the completely random node sets on average (see Fig. 5a). See Fig. 5b–d for an example. In nine out of the twelve pairs of training and test dynamics, the optimized node sets also outperform the degree-preserving node sets (see Fig. 5a). Therefore, the optimized node sets are transferrable between different dynamics to some extent. This result is striking in particular because different dynamics may have different bifurcation structures. For example, the coupled double-well dynamics have saddle-node bifurcations, and the SIS dynamics have transcritical bifurcations. Additionally, the completely random node sets in each panel show an approximately linear relationship. Therefore, the random node sets that happen by chance to be good sentinels for a specific training dynamics continue to be good sentinels for the test dynamics on average. This result further supports our claim that the performance of node sets is transferrable across different dynamics.

[Uncaptioned image]
Figure 5: Transfer learnability of the sentinel node approximation. (a) Approximation error for optimized node sets evaluated on training and test dynamics. Using the dolphin network, we computed 100 optimized node sets on one (training) dynamics. For the optimized node sets, we show the approximation error when evaluated on the same training dynamics, shown by the blue circles on the horizontal axis, and on an alternative (i.e., test) dynamics, shown on the vertical axis. We also show the approximation errors of 100 degree-preserving node sets (orange triangles) and 100 completely random node sets (green squares) evaluated on the training and test dynamics. The large markers show the average approximation error over the 100 node sets for each type of node set. (b) Dolphin network. The large blue nodes indicate the members of the optimized node set with n=4𝑛4n=4italic_n = 4 nodes that attains the lowest approximation error on the coupled double-well training dynamics. (c) Sentinel node approximation for the coupled double-well dynamics on the dolphin network. The states of the nodes marked in (b) are shown in blue. The states of the remaining nodes are shown in gray. The black line represents the unweighted average state of all nodes, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG. The purple line represents the sentinel node approximation of the node set marked in blue. (d) Sentinel node approximation for the SIS dynamics on the same network when the node set is optimized for the coupled double-well dynamics. The blue lines show xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for the nodes marked in (b).

As before, we assess the generalizability of these observations with an ANOVA. The dependent variable is lnε𝜀\ln\varepsilonroman_ln italic_ε evaluated on the given test dynamics. The independent variables are the training dynamics, test dynamics, network, and node set type (see SI section S7 for details). Despite the substantial increase in ε𝜀\varepsilonitalic_ε due to the lack of knowledge about the test dynamics, optimized node sets still achieve ε𝜀\varepsilonitalic_ε that is 5.74 times smaller than completely random node sets (p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT) and 1.55 times smaller than degree-preserving node sets (p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT). We obtained similar results on the feasibility of transfer learning in the case of weighted networks (see SI section S9) and directed networks (see SI section S10). In sum, these results suggest that our optimization algorithm can select good sentinel node sets when we do not know the test dynamics, or even the type of bifurcation.

II.4 Weighted averaging

We have restricted ourselves to unweighted averages of xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as our approximators. Previous work has considered the case n=N𝑛𝑁n=Nitalic_n = italic_N and chosen the weight vector (a1,,aN)subscript𝑎1subscript𝑎𝑁(a_{1},\ldots,a_{N})( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) to be either the degree vector (i.e., aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is proportional to the degree of the i𝑖iitalic_ith node) [9, 23, 15] or a particular eigenvector [10, 11, 13, 24, 14]. In this section, we keep n𝑛nitalic_n small and additionally consider optimization of the weights of each node to allow a weighted average of xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT given by x¯=i=1;iSNaixi,superscript¯𝑥superscriptsubscriptformulae-sequence𝑖1𝑖𝑆𝑁subscript𝑎𝑖superscriptsubscript𝑥𝑖\overline{x}^{\prime}=\sum_{i=1;i\in S}^{N}a_{i}x_{i,\ell}^{*}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 ; italic_i ∈ italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, subject to i=1;iSNai=1superscriptsubscriptformulae-sequence𝑖1𝑖𝑆𝑁subscript𝑎𝑖1\sum_{i=1;i\in S}^{N}a_{i}=1∑ start_POSTSUBSCRIPT italic_i = 1 ; italic_i ∈ italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 and ai0subscript𝑎𝑖0a_{i}\geq 0italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 iSfor-all𝑖𝑆\forall i\in S∀ italic_i ∈ italic_S. We find these weights using a quadratic programming step (see the Methods section for details) inserted into our original algorithm.

We find that the additional optimization of node weights leads to substantially smaller approximation error, ε𝜀\varepsilonitalic_ε, than the unweighted average examined in the previous sections. Across all dynamics and networks, ε𝜀\varepsilonitalic_ε is on average an additional 46 times smaller when we optimize both node set and node weight than when we only optimize the node set (see SI section S8 for the statistical results).

However, there is no extra advantage to optimizing node weights when one evaluates ε𝜀\varepsilonitalic_ε on alternative dynamics. As in section II.3, we generated, for each dynamics and network, 100 node sets with their node selection and weights optimized on a training dynamics. We then evaluated ε𝜀\varepsilonitalic_ε of the optimized node set on each other test dynamics. We found that, when evaluated on the test dynamics, ε𝜀\varepsilonitalic_ε was not significantly different between the optimized node sets with additional weight optimization and those without (p=0.591𝑝0.591p=0.591italic_p = 0.591). Therefore, although optimizing node weights drastically improves the approximation to x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG if we know the actual dynamics, this advantage is lost if we do not know it.

III Discussion

As our understanding of complex system dynamics advances [3, 1, 25, 26, 27, 28, 29], a crucial bottleneck that we continue to face is our limited empirical access to the states of their multitude of relevant parameters. This bottleneck hinders the development of new theoretical analyses and empirical validation of existing theories on dynamics. A similar challenge in biology and medicine has been that it is often difficult to track every single metabolite, protein, or cell. In that context, the identification of biomarkers [30, 31] can offer dramatic breakthroughs, potentially allowing us to identify the state of the entire system just by tracking a small set of indicative parameters. Our work here envisions an equivalent opportunity in the context of complex networks.

A crucial aspect of our formulation is that the detected sentinels are largely agnostic to the dynamic model. This indicates that one can transfer the learned set of nodes from one dynamics to infer the state of the system under another dynamics. Such transfer learning is especially useful for systems with unknown dynamics, capturing a rather common challenge in the context of complex networked systems [32, 33, 34, 35, 36]. To further capitalize on this potential advantage, future work should aim to extract a master model, i.e., a nonlinear system of equations whose sentinel nodes optimally transfer to other types of dynamics.

Our work differs from other approaches for approximating the population activity of or efficiently observing a network by attempting to accurately estimate x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG. First, there are methods to select a small number of sentinel nodes for constructing early warning signals for anticipating regime shifts [16, 17, 18, 19, 20], but there is no a priori reason to consider that these sentinel nodes are also good at representing x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG. Second, the theory of network observability allows us to determine the node set, observation of which enables one to reconstruct the system’s complete activity [37]. The method has also been applied for selecting sentinel nodes with which to construct early warning signals [18]. However, this observability concerns inference of the initial condition, i.e., the activity of each node at time 00. In fact, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG in the equilibrium or the time course of it, rather than its initial value, would be of practical interest in applications.

Our work also differs from prior work in the importance it attributes to hub nodes in heterogeneous networks. Both the GBB [9] and DART [10, 11] reduce dynamical systems on networks into a similar dynamical system of small dimensions. A key idea behind these methods is to theoretically find an appropriate linear combination x=i=1Naixidelimited-⟨⟩𝑥superscriptsubscript𝑖1𝑁subscript𝑎𝑖superscriptsubscript𝑥𝑖\langle x\rangle=\sum_{i=1}^{N}a_{i}x_{i}^{*}⟨ italic_x ⟩ = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of x1superscriptsubscript𝑥1x_{1}^{*}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, \ldots, xNsuperscriptsubscript𝑥𝑁x_{N}^{*}italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the dynamical system that xdelimited-⟨⟩𝑥\langle x\rangle⟨ italic_x ⟩ approximately obeys. To do so, GBB and DART use the degree vector and the dominant eigenvector of the adjacency matrix as 𝒂=(a1,,aN)𝒂subscript𝑎1subscript𝑎𝑁\bm{a}=(a_{1},\ldots,a_{N})bold_italic_a = ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), respectively, as the first xdelimited-⟨⟩𝑥\langle x\rangle⟨ italic_x ⟩. In fact, many large empirical networks are scale-free networks, i.e., those in which the degree obeys an approximate power-law distribution [38, 39, 40], so that a small fraction of hub nodes have very large degrees. In a related vein, many empirical and model networks show eigenvector localization phenomena, with which the most entries of the dominant eigenvector of the adjacency matrix are close to 00 [41, 42]. Eigenvector localization implies that, at least in large degree-heterogeneous but otherwise uniformly random networks, most of the entries of 𝒂𝒂\bm{a}bold_italic_a are close to 00 and do not contribute to xdelimited-⟨⟩𝑥\langle x\rangle⟨ italic_x ⟩. Therefore, xdelimited-⟨⟩𝑥\langle x\rangle⟨ italic_x ⟩ in the GBB and DART is approximately a weighted sum of a small number of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This situation apparently resembles selecting n𝑛nitalic_n (Nmuch-less-thanabsent𝑁\ll N≪ italic_N) nodes for S𝑆Sitalic_S in our framework. However, these methods and our sentinel node approximation select very different nodes. The nodes effectively used by GBB and DART (i.e., those whose aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT values are far from 00) tend to be hub nodes. Therefore, GBB and DART deliberately neglect the behavior of a majority of nodes, which have a small degree. In contrast, our optimized node set tends to avoid hub nodes, which are outliers, and instead mixes nodes with different degrees that are not extremely large. This discrepancy is because our goal is to observe the unweighted average, x¯=i=1Nxi/N¯𝑥superscriptsubscript𝑖1𝑁superscriptsubscript𝑥𝑖𝑁\overline{x}=\sum_{i=1}^{N}x_{i}^{*}/Nover¯ start_ARG italic_x end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT / italic_N, while the goal of GBB and DART is to derive a dynamical system approximately closed in terms of xdelimited-⟨⟩𝑥\langle x\rangle⟨ italic_x ⟩. Although hubs play important roles in dynamics on networks such as impacting on the epidemic threshold [43, 44] and having different responsiveness to perturbation compared to low-degree nodes [25], our optimized node sets does not include them to approximate x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG because their behavior is in general different from that of the majority of nodes.

Our framework accommodates directed and weighted networks, noisy dynamics, and, as mentioned above, the case in which dynamical system equations are unknown. Furthermore, without essential changes, our method allows any weighted sum of {x1,,xN}superscriptsubscript𝑥1superscriptsubscript𝑥𝑁\{x_{1}^{*},\ldots,x_{N}^{*}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } as the target quantity to be approximated. Together, these features make our method versatile enough for potential applications. For example, in ecosystem monitoring, our algorithm could help focus monitoring efforts on a small number of key species [45], which may not be the most numerous or well-connected species [46, 47]. Assuming that an estimated network of species interactions is available, which is often the case [48], one can derive a small number of sentinel species (i.e., nodes) by simulating a “ground truth” using a mutualistic species dynamics model on the measured network. Then, the populations of the chosen sentinel species are expected to provide a good approximation of population dynamics of the entire ecosystem.

From a methodological perspective, our drastic reduction from N𝑁Nitalic_N to just a small number of nodes is rooted in a combined toolbox that mixes theoretical modeling with machine learning techniques. The theoretical models, captured within the low-dimensional dynamical systems derived by GBB, DART, or other methods, represent our analytical insights into the system’s internal mechanisms. Such insights, however, are limited by the prohibitive complexity of the system. This is precisely where the machine learning component helps complement the theoretical approach, lending its computational power to help simplify the scale of the problem. This synergy between theoretical insights and the predictive power of modern computational learning technology represents, in our view, one of the main avenues towards predicting, observing, and influencing complex network behavior.

IV Methods

IV.1 Node Selection

Consider an undirected and unweighted network with N𝑁Nitalic_N nodes, M𝑀Mitalic_M edges, and an adjacency matrix A=(Aij)𝐴subscript𝐴𝑖𝑗A=(A_{ij})italic_A = ( italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) with Aij{0,1}subscript𝐴𝑖𝑗01A_{ij}\in\{0,1\}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ { 0 , 1 }, Aii=0subscript𝐴𝑖𝑖0A_{ii}=0italic_A start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT = 0, and Aij=Ajisubscript𝐴𝑖𝑗subscript𝐴𝑗𝑖A_{ij}=A_{ji}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT. Each node is characterized by a continuous variable xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT representing, for example, the abundance of a species or the fraction of a population that is infected with a pathogen. A straightforward estimate of the network’s aggregate state is the unweighted average of all node states at equilibrium, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG, given by Eq. (1). Our goal is to approximate x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG by a linear combination of a small number of equilibrium node states, xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, given by Eq. (2), with a small error and across a range of a control parameter that represents, e.g., an environmental state. Examples of the control parameter are the strength of mutualistic interactions between species and the infection rate of a contagion process. Choice of the control parameter, its range, and its effect on xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT varies by model and is further described in section IV.2.

We seek a set S𝑆Sitalic_S of sentinel nodes with |S|=n𝑆𝑛|S|=n| italic_S | = italic_n, nNmuch-less-than𝑛𝑁n\ll Nitalic_n ≪ italic_N. We assume that each node in S𝑆Sitalic_S has equal weight in determining x¯superscript¯𝑥\overline{x}^{\prime}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, as in Eq. (2). We relax this assumption in section II.4. We quantify the discrepancy between x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG and x¯superscript¯𝑥\overline{x}^{\prime}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by a normalized mean squared error over a given range of the control parameter, i.e., ε==1L(x¯x¯)2L=1Lx¯𝜀superscriptsubscript1𝐿superscriptsubscriptsuperscript¯𝑥subscript¯𝑥2𝐿superscriptsubscript1𝐿subscript¯𝑥\varepsilon=\frac{\sum_{\ell=1}^{L}(\overline{x}^{\prime}_{\ell}-\overline{x}_% {\ell})^{2}}{L\sum_{\ell=1}^{L}\overline{x}_{\ell}}italic_ε = divide start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_L ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT end_ARG, called the approximation error. Subscript {1,L}1𝐿\ell\in\{1,\ldots L\}roman_ℓ ∈ { 1 , … italic_L } specifies a value of an evenly spaced control parameter, and xi,superscriptsubscript𝑥𝑖x_{i,\ell}^{*}italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT represents the equilibrium value of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the \ellroman_ℓth value of the control parameter; \ellroman_ℓ is omitted in the following text when there should be no confusion. We choose a control parameter which, when varied over the first and last (i.e., L𝐿Litalic_Lth) values, causes bifurcations in xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We set L=100𝐿100L=100italic_L = 100 in our numerical experiments.

Note that two other known methods for dynamics reduction on networks, GBB [9, 23, 15] and DART [10, 11, 13, 24, 14], attempt to match a different weighted sum of x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, \ldots, xNsubscript𝑥𝑁x_{N}italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT from x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG. Specifically, the observables of the GBB and DART are in the form i=1Naixisuperscriptsubscript𝑖1𝑁subscript𝑎𝑖subscript𝑥𝑖\sum_{i=1}^{N}a_{i}x_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where the aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is proportional to the degree of the i𝑖iitalic_ith node in the case of the GBB and the i𝑖iitalic_ith entry of the leading eigenvector of the adjacency matrix in the case of the DART.

Given all xi,superscriptsubscript𝑥𝑖x_{i,\ell}^{*}italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT values (with i{1,,N}𝑖1𝑁i\in\{1,\ldots,N\}italic_i ∈ { 1 , … , italic_N } and {1,,L}1𝐿\ell\in\{1,\ldots,L\}roman_ℓ ∈ { 1 , … , italic_L }) computed, we determine S𝑆Sitalic_S by combinatorial simulated annealing. Specifically, we first initialize S𝑆Sitalic_S by selecting n𝑛nitalic_n nodes uniformly at random from the network and calculate the approximation error for S𝑆Sitalic_S, which we denote by ε(S)𝜀𝑆\varepsilon(S)italic_ε ( italic_S ). We arbitrarily set n=lnN𝑛𝑁n=\lfloor\ln N\rflooritalic_n = ⌊ roman_ln italic_N ⌋, where \lfloor\cdot\rfloor⌊ ⋅ ⌋ is the floor operation, unless we state otherwise. Then, in each hhitalic_hth iteration of the algorithm, we select an i𝑖iitalic_ith node uniformly at random from S𝑆Sitalic_S and replace it with a j𝑗jitalic_jth node that does not belong to S𝑆Sitalic_S and is chosen uniformly at random. The tentative new node set is given by S=S{i}{j}superscript𝑆𝑆𝑖𝑗S^{\prime}=S\setminus\{i\}\cup\{j\}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_S ∖ { italic_i } ∪ { italic_j }. We compute ε𝜀\varepsilonitalic_ε for Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which we denote by ε(S)𝜀superscript𝑆\varepsilon(S^{\prime})italic_ε ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). We accept Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as the new S𝑆Sitalic_S according to the Metropolis criterion, i.e., with probability 1111 if ε(S)<ε(S)𝜀superscript𝑆𝜀𝑆\varepsilon(S^{\prime})<\varepsilon(S)italic_ε ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_ε ( italic_S ) and with probability q=exp{[ε(S)ε(S)]/t}𝑞delimited-[]𝜀superscript𝑆𝜀𝑆𝑡q=\exp\{-[\varepsilon(S^{\prime})-\varepsilon(S)]/t\}italic_q = roman_exp { - [ italic_ε ( italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ε ( italic_S ) ] / italic_t } otherwise. The normalization factor t=t0/ln(h+e1)𝑡subscript𝑡0𝑒1t=t_{0}/\ln(h+e-1)italic_t = italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / roman_ln ( italic_h + italic_e - 1 ) decreases as h{1,,hmax}1subscripth\in\{1,\ldots,h_{\max}\}italic_h ∈ { 1 , … , italic_h start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } increases. Therefore, an Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with a larger approximation error than S𝑆Sitalic_S is less likely to be accepted as hhitalic_h increases, encouraging convergence to a local minimum of ε𝜀\varepsilonitalic_ε. We set t0=10subscript𝑡010t_{0}=10italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10. We also set hmax=50Nsubscript50𝑁h_{\max}=50Nitalic_h start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 50 italic_N and confirmed that this number of iteration was sufficient for reaching a local minimum of ε𝜀\varepsilonitalic_ε.

IV.2 Dynamical system models

We test our method on the following four models of dynamical systems on networks.

Many systems have alternative stable states at a single value of a control parameter, the transition between which displays hysteresis. For example, a certain species may be present or extinct in an ecosystem under a given environmental condition, depending on the system’s state in the past [49]. Recovery from the extinct state may be difficult without substantial improvement of the environment. Similarly, tropical ecosystems may show sudden transitions from rainforest to grassland in response to just small changes in local water availability [50]. A coupled double-well dynamics model on networks [51, 52, 53, 54, 55] has been employed for modeling interacting climate regions [50] and biological species [21] and is given by

dxidt=(xir1)(xir2)(xir3)+Dj=1NAijxj,𝑑subscript𝑥𝑖𝑑𝑡subscript𝑥𝑖subscript𝑟1subscript𝑥𝑖subscript𝑟2subscript𝑥𝑖subscript𝑟3𝐷superscriptsubscript𝑗1𝑁subscript𝐴𝑖𝑗subscript𝑥𝑗\frac{dx_{i}}{dt}=-(x_{i}-r_{1})(x_{i}-r_{2})(x_{i}-r_{3})+D\sum_{j=1}^{N}A_{% ij}x_{j},divide start_ARG italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = - ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + italic_D ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (4)

where r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT determine the location of the equilibria and satisfy r1<r2<r3subscript𝑟1subscript𝑟2subscript𝑟3r_{1}<r_{2}<r_{3}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT; D𝐷Ditalic_D is the coupling strength parameter. We set r1=1subscript𝑟11r_{1}=1italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1, r2=3subscript𝑟23r_{2}=3italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3, and r3=5subscript𝑟35r_{3}=5italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 5. In the absence of coupling, this model has stable equilibria at x=r1𝑥subscript𝑟1x=r_{1}italic_x = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x=r3𝑥subscript𝑟3x=r_{3}italic_x = italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and an unstable equilibrium at x=r2𝑥subscript𝑟2x=r_{2}italic_x = italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In the presence of coupling, all xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs tend to be near r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT when D𝐷Ditalic_D is sufficiently small, and they tend to be near x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT when D𝐷Ditalic_D is large. We use D𝐷Ditalic_D as the control parameter in the range [0,1]01[0,1][ 0 , 1 ].

A similar model proposed for mutualistic species interactions is given by

dxidt=B+xi(1xiK)(xiC1)+Dj=1NAijxixjD~+Exi+Hxj,𝑑subscript𝑥𝑖𝑑𝑡𝐵subscript𝑥𝑖1subscript𝑥𝑖𝐾subscript𝑥𝑖𝐶1𝐷superscriptsubscript𝑗1𝑁subscript𝐴𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗~𝐷𝐸subscript𝑥𝑖𝐻subscript𝑥𝑗\frac{dx_{i}}{dt}=B+x_{i}\left(1-\frac{x_{i}}{K}\right)\left(\frac{x_{i}}{C}-1% \right)+D\sum_{j=1}^{N}A_{ij}\frac{x_{i}x_{j}}{\tilde{D}+Ex_{i}+Hx_{j}},divide start_ARG italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = italic_B + italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_K end_ARG ) ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_C end_ARG - 1 ) + italic_D ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG over~ start_ARG italic_D end_ARG + italic_E italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_H italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , (5)

where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the abundance of the i𝑖iitalic_ith species and B𝐵Bitalic_B, K𝐾Kitalic_K, C𝐶Citalic_C, D𝐷Ditalic_D, D~~𝐷\tilde{D}over~ start_ARG italic_D end_ARG, E𝐸Eitalic_E, and H𝐻Hitalic_H are constants [9]. The constant B𝐵Bitalic_B represents migration rate and K𝐾Kitalic_K the carrying capacity. The Allee constant, C𝐶Citalic_C, represents the ease with which a species can become established in the environment. In the absence of migration and species interactions, xi>0subscript𝑥𝑖0x_{i}>0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 at equilibrium if and only if the initial xi>Csubscript𝑥𝑖𝐶x_{i}>Citalic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_C; otherwise xi=0subscript𝑥𝑖0x_{i}=0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 is the only stable equilibrium. We set B=1𝐵1B=1italic_B = 1, K=5𝐾5K=5italic_K = 5, D~=5~𝐷5\tilde{D}=5over~ start_ARG italic_D end_ARG = 5, E=0.9𝐸0.9E=0.9italic_E = 0.9, and H=0.1𝐻0.1H=0.1italic_H = 0.1 following [9]. We use D𝐷Ditalic_D as the control parameter in the range [0,3]03[0,3][ 0 , 3 ]; to observe the transition of all nodes, we use a more extended range of D𝐷Ditalic_D for this model than for the other models described in this section.

The deterministic susceptible-infectious-susceptible (SIS) model on networks, also called the individual-based approximation of the stochastic SIS model, is given by

dxidt=μxi+λj=1NAij(1xi)xj,𝑑subscript𝑥𝑖𝑑𝑡𝜇subscript𝑥𝑖𝜆superscriptsubscript𝑗1𝑁subscript𝐴𝑖𝑗1subscript𝑥𝑖subscript𝑥𝑗\frac{dx_{i}}{dt}=-\mu x_{i}+\lambda\sum_{j=1}^{N}A_{ij}(1-x_{i})x_{j},divide start_ARG italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = - italic_μ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( 1 - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (6)

where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the probability that the i𝑖iitalic_ith node is infectious, λ𝜆\lambdaitalic_λ is the infection rate, and μ𝜇\muitalic_μ is the recovery rate [44]. The second term on the right-hand side expresses the rate at which the j𝑗jitalic_jth node infects the i𝑖iitalic_ith node. We set μ=1𝜇1\mu=1italic_μ = 1 without loss of generality; changing (λ,μ)𝜆𝜇(\lambda,\mu)( italic_λ , italic_μ ) to (cλ,cμ)𝑐𝜆𝑐𝜇(c\lambda,c\mu)( italic_c italic_λ , italic_c italic_μ ) with a constant c𝑐citalic_c is equivalent to scaling the time by a factor of c𝑐citalic_c, so it does not affect the equilibrium. We use λ𝜆\lambdaitalic_λ as the control parameter in the range [0,1]01[0,1][ 0 , 1 ]. One obtains xi=0subscript𝑥𝑖0x_{i}=0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 ifor-all𝑖\forall i∀ italic_i when λ𝜆\lambdaitalic_λ is below a value called the epidemic threshold. Above the epidemic threshold, all xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (<1absent1<1< 1) values are positive given a positive initial value. The parameter λ𝜆\lambdaitalic_λ has the same role as D𝐷Ditalic_D in the other dynamics models.

A model of gene regulatory dynamics is given by

dxidt=Bxif+Dj=1NAijxjh1+xjh,𝑑subscript𝑥𝑖𝑑𝑡𝐵superscriptsubscript𝑥𝑖𝑓𝐷superscriptsubscript𝑗1𝑁subscript𝐴𝑖𝑗superscriptsubscript𝑥𝑗1superscriptsubscript𝑥𝑗\frac{dx_{i}}{dt}=-Bx_{i}^{f}+D\sum_{j=1}^{N}A_{ij}\frac{x_{j}^{h}}{1+x_{j}^{h% }},divide start_ARG italic_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = - italic_B italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT + italic_D ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_ARG start_ARG 1 + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_ARG , (7)

where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the expression level of the i𝑖iitalic_ith gene [9]. We set B=1𝐵1B=1italic_B = 1, f=1𝑓1f=1italic_f = 1, and h=22h=2italic_h = 2 following [9]. We use D𝐷Ditalic_D as the control parameter in the range [0,1]01[0,1][ 0 , 1 ]. Given sufficiently large initial values, all xisuperscriptsubscript𝑥𝑖x_{i}^{*}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT will remain large when D𝐷Ditalic_D is above a threshold value, a situation which represents a living cell. When D𝐷Ditalic_D is small, all xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTs approach zero, representing cell death.

To compute equilibrium values of xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at each control parameter value, we initially set each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to a model-dependent standard value (coupled double-well: xi=1subscript𝑥𝑖1x_{i}=1italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, SIS: xi=0.01subscript𝑥𝑖0.01x_{i}=0.01italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0.01, mutualistic species: xi=0.001subscript𝑥𝑖0.001x_{i}=0.001italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0.001, gene-regulatory: xi=2subscript𝑥𝑖2x_{i}=2italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 2). Then, we solved the ODEs using the LSODA algorithm [56] provided by the deSolve package for R [57]. We used a total simulation time of T=15𝑇15T=15italic_T = 15, which we found was sufficient to allow each of the above dynamics on each network described below to relax to a point where no further change was noticeable. We used the value of each xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at T=15𝑇15T=15italic_T = 15 as xi,superscriptsubscript𝑥𝑖x_{i,\ell}^{*}italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

IV.3 Networks

We chose five empirical and five model networks to test our method. All networks were coerced to be undirected, simple (i.e., no self- or multi-edges), and unweighted. We used the largest connected component.

In the social network of wild dolphins [58], which we refer to as the dolphin network, each node is a dolphin individual. Two nodes are defined to be adjacent if two individuals were observed together more often than expected by chance. This network has N=62𝑁62N=62italic_N = 62 nodes, M=159𝑀159M=159italic_M = 159 edges, and an average degree k¯=5.13¯𝑘5.13\overline{k}=5.13over¯ start_ARG italic_k end_ARG = 5.13. The coefficient of variation (CV), defined as the standard deviation divided by the mean, of the degree is 0.58.

The BA model generates networks with power-law degree distributions [38]. We initialized a BA network with a complete graph of three nodes and three edges, and added a single node with m=2𝑚2m=2italic_m = 2 edges in each time step. The final network had N=1,000𝑁1000N=1,000italic_N = 1 , 000, M=1,996𝑀1996M=1,996italic_M = 1 , 996, k¯=3.99¯𝑘3.99\overline{k}=3.99over¯ start_ARG italic_k end_ARG = 3.99, and CV of the degree is equal to 1.331.331.331.33.

We used eight other networks as well as the dolphin and BA networks in the ANOVA analyses. See SI section S1 for the description of these eight networks.

IV.4 Optimization of node weights

In section II.4, we approximate x¯subscript¯𝑥\overline{x}_{\ell}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT with a weighted average of the node states in S𝑆Sitalic_S, x¯=i=1;iSNaixi,subscriptsuperscript¯𝑥superscriptsubscriptformulae-sequence𝑖1𝑖𝑆𝑁subscript𝑎𝑖superscriptsubscript𝑥𝑖\overline{x}^{\prime}_{\ell}=\sum_{i=1;i\in S}^{N}a_{i}x_{i,\ell}^{*}over¯ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 ; italic_i ∈ italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The GBB and DART assign weights to all the nodes in the network. In contrast, here we determine weights only for the nodes in S𝑆Sitalic_S by a quadratic programming optimization.

We insert the quadratic programming step into our node selection method as follows. Assume that we have a set of nodes S𝑆Sitalic_S, |S|=n𝑆𝑛|S|=n| italic_S | = italic_n. This S𝑆Sitalic_S can be random, as in the initial iteration of the algorithm, or a candidate set Ssuperscript𝑆S^{\prime}italic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from any subsequent iteration. Let 𝒂={ai}iS𝒂subscriptsubscript𝑎𝑖𝑖𝑆\bm{a}=\{a_{i}\}_{i\in S}bold_italic_a = { italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT be an n𝑛nitalic_n-dimensional vector of weights. We require that these weights are non-negative and sum to 1111. The mean squared error over the L𝐿Litalic_L values of the control parameter is given by

ϵ=1L=1L(i=1;iSNaixi,x¯)2=1L(𝒂XX𝒂2𝒙¯X𝒂+𝒙¯𝒙¯),italic-ϵ1𝐿superscriptsubscript1𝐿superscriptsuperscriptsubscriptformulae-sequence𝑖1𝑖𝑆𝑁subscript𝑎𝑖superscriptsubscript𝑥𝑖subscript¯𝑥21𝐿superscript𝒂topsuperscript𝑋top𝑋𝒂2superscriptbold-¯𝒙top𝑋𝒂superscriptbold-¯𝒙topbold-¯𝒙\epsilon=\frac{1}{L}\sum_{\ell=1}^{L}\left(\sum_{i=1;i\in S}^{N}a_{i}x_{i,\ell% }^{*}-\overline{x}_{\ell}\right)^{2}=\frac{1}{L}\left(\bm{a}^{\top}X^{\top}X% \bm{a}-2\bm{\overline{x}}^{\top}X\bm{a}+\bm{\overline{x}}^{\top}\bm{\overline{% x}}\right),italic_ϵ = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_i = 1 ; italic_i ∈ italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ( bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X bold_italic_a - 2 overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X bold_italic_a + overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT overbold_¯ start_ARG bold_italic_x end_ARG ) , (8)

where X𝑋Xitalic_X is the L×n𝐿𝑛L\times nitalic_L × italic_n matrix of xi,superscriptsubscript𝑥𝑖x_{i,\ell}^{*}italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (with {1,,L}1𝐿\ell\in\{1,\ldots,L\}roman_ℓ ∈ { 1 , … , italic_L }, iS𝑖𝑆i\in Sitalic_i ∈ italic_S), 𝒙¯=(x¯1,,x¯L)bold-¯𝒙superscriptsubscript¯𝑥1subscript¯𝑥𝐿top\bm{\overline{x}}=(\overline{x}_{1},\ldots,\overline{x}_{L})^{\top}overbold_¯ start_ARG bold_italic_x end_ARG = ( over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and represents the transposition. Therefore, we determine 𝒂𝒂\bm{a}bold_italic_a by

min\displaystyle\minroman_min 12𝒂XX𝒂𝒙¯X𝒂,12superscript𝒂topsuperscript𝑋top𝑋𝒂superscriptbold-¯𝒙top𝑋𝒂\displaystyle\frac{1}{2}\bm{a}^{\top}X^{\top}X\bm{a}-\bm{\overline{x}}^{\top}X% \bm{a},divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X bold_italic_a - overbold_¯ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_X bold_italic_a , (9)
s.t. 𝟏𝒂=1,superscript1top𝒂1\displaystyle\bm{1}^{\top}\bm{a}=1,bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_a = 1 ,
𝒂𝟎,𝒂0\displaystyle\bm{a}\geq\bm{0},bold_italic_a ≥ bold_0 ,

where 𝟏=(1,,1)1superscript11top\bm{1}=(1,\ldots,1)^{\top}bold_1 = ( 1 , … , 1 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. We use the qpOASES algorithm [59] to solve for 𝒂𝒂\bm{a}bold_italic_a via the ROI package [60] in R. In an iteration of our algorithm, we first tentatively update S𝑆Sitalic_S by carrying out one step of the algorithm for the unweighted averaging, optimize the node weights, 𝒂𝒂\bm{a}bold_italic_a, given the updated S𝑆Sitalic_S by solving Eq. (9), and calculate ε𝜀\varepsilonitalic_ε for the updated S𝑆Sitalic_S and optimized 𝒂𝒂\bm{a}bold_italic_a. We then compare the obtained ε𝜀\varepsilonitalic_ε with the value of ε𝜀\varepsilonitalic_ε before updating S𝑆Sitalic_S and optimizing 𝒂𝒂\bm{a}bold_italic_a, and apply the Metropolis criterion to determine whether or not we adopt the tentatively updated S𝑆Sitalic_S and 𝒂𝒂\bm{a}bold_italic_a. We iterate this process as described in Sec. IV.1.

Note that this additional optimization step substantially increases computation time. However, in practice, local minima are reached after fewer iterations of the simulated annealing algorithm. Therefore, we use 25N25𝑁25N25 italic_N iterations of the algorithm when we optimize the node weights, instead of the 50N50𝑁50N50 italic_N iterations used when we do not optimize the node weights.

Data and code availability

Data and analysis code are available at https://github.com/ngmaclaren/reduction.

Acknowledgements

This work was performed in part at Center for Computational Research, the State University of New York at Buffalo.

Author contributions

B.B. and N. Masuda conceived the study. N.G. MacLaren and N. Masuda developed the method and analyzed the data. N.G. MacLaren wrote the code and ran simulations. All the authors wrote the manuscript.

Funding

B.B. was supported by the Israel Science Foundation (grant no. 499/19), the Israel-China ISF-NSFC joint research program (grant no. 3552/21), and by the VATAT grant for data science research. N. Masuda was supported by the Japan Science and Technology Agency (JST) Moonshot R&D (under grant no. JPMJMS2021), the National Science Foundation (under grant no. 2052720), and JSPS KAKENHI (under grant nos. JP 21H04595, 23H03414, and 24K14840).

References

  • [1] M. A. Porter and J. P. Gleeson. Dynamical systems on networks: A tutorial. Springer, London, UK, 2010.
  • [2] M. E. J. Newman. Networks. Oxford University Press, Oxford, UK, 2nd edition, 2018.
  • [3] A. Barrat, M. Barthélemy, and A. Vespignani. Dynamical processes on complex networks. Cambridge University Press, Cambridge, UK, 2008.
  • [4] G. Karlebach and R. Shamir. Modelling and analysis of gene regulatory networks. Nature Reviews Molecular Cell Biology, 9:770–780, 2008.
  • [5] C. Trapnell. Defining cell types and states with single-cell genomics. Genome Research, 25:1491–1498, 2015.
  • [6] A. Wagner, A. Regev, and N. Yosef. Revealing the vectors of cellular identity with single-cell genomics. Nature Biotechnology, 34(11):1145–1160, 2016.
  • [7] M. S. Wisz, J. Pottier, W. D. Kissling, L. Pellissier, J. Lenoir, C. F. Damgaard, C. F. Dormann, M. C. Forchhammer, J.-A. Grytnes, A. Guisan, R. K. Heikkinen, T. T. Høye, I. Kühn, M. Luoto, L. Maiorano, M.-C. Nilsson, S. Normand, E. Öckinger, N. M. Schmidt, M. Termansen, A. Timmermann, D. A. Wardle, P. Aastrup, and J.-C. Svenning. The role of biotic interactions in shaping distributions and realised assemblages of species: Implications for species distribution modelling. Biological Reviews, 88:15–30, 2013.
  • [8] J. Bascompte and M. Scheffer. The resilience of plant–pollinator networks. Annual Review of Entomology, 68:363–380, 2023.
  • [9] J. Gao, B. Barzel, and A.-L. Barabási. Universal resilience patterns in complex networks. Nature, 530:307–312, 2016.
  • [10] E. Laurence, N. Doyon, L. J. Dubé, and P. Desrosiers. Spectral dimension reduction of complex dynamical networks. Physical Review X, 9:011042, 2019.
  • [11] V. Thibeault, G. St-Onge, L. J. Dubé, and P. Desrosiers. Threefold way to the dimension reduction of dynamics on networks: An application to synchronization. Physical Review Research, 2(4):043215, 2020.
  • [12] C. Tu, P. D’Odorico, and S. Suweis. Dimensionality reduction of complex dynamical systems. iScience, 24:101912, 2021.
  • [13] N. Masuda and P. Kundu. Dimension reduction of dynamical systems on networks with leading and non-leading eigenvectors of adjacency matrices. Physical Review Research, 4(2):023257, 2022.
  • [14] M. Vegué, V. Thibeault, P. Desrosiers, and A. Allard. Dimension reduction of dynamics on modular and heterogeneous directed networks. PNAS Nexus, 2(5):pgad150, 2023.
  • [15] C. Ma, G. Korniss, B. K. Szymanski, and J. Gao. Generalized dimension reduction approach for heterogeneous networked systems with time-delay. arXiv, page 2308.11666, 2023.
  • [16] L. Chen, R. Liu, Z.-P. Liu, M. Li, and K. Aihara. Detecting early-warning signals for sudden deterioration of complex diseases by dynamical network biomarkers. Scientific Reports, 2:342, 2012.
  • [17] F. Vafaee. Using multi-objective optimization to identify dynamical network biomarkers as early-warning signals of complex diseases. Scientific Reports, 6:22023, 2016.
  • [18] A. Aparicio, J. X. Velasco-Hernández, C. H. Moog, Y.-Y. Liu, and M. T. Angulo. Structure-based identification of sensor species for anticipating critical transitions. Proceedings of the National Academy of Sciences of the United States of America, 118(51):e2104732118, 2021.
  • [19] N. G. MacLaren, P. Kundu, and N. Masuda. Early warnings for multi-stage transitions in dynamics on networks. Journal of the Royal Society Interface, 20:20220743, 2023.
  • [20] N. Masuda, K. Aihara, and N. G. MacLaren. Anticipating regime shifts by mixing early warning signals from different nodes. Nature Communications, 15:1086, 2024.
  • [21] J. J. Lever, I. A. van de Leemput, E. Weinans, R. Quax, V. Dakos, E. H. van Nes, J. Bascompte, and M. Scheffer. Foreseeing the future of mutualistic communities beyond collapse. Ecology Letters, 23:2–15, 2020.
  • [22] N. Wunderling, M. Gelbrecht, R. Winkelmann, J. Kurths, and J. F. Donges. Basin stability and limit cycles in a conceptual model for climate tipping cascades. New Journal of Physics, 22:123031, 2020.
  • [23] H. Zhang, Q. Wang, W. Zhang, S. Havlin, and J. Gao. Estimating comparable distances to tipping points across mutualistic systems by scaled recovery rates. Nature Ecology & Evolution, 6:1524–1536, 2022.
  • [24] V. Thibeault, A. Allard, and P. Desrosiers. The low-rank hypothesis of complex systems. Nature Physics, 20:294–302, 2024.
  • [25] B. Barzel and A.-L. Barabási. Universality in network dynamics. Nature Physics, 9:673–681, 2013.
  • [26] Z. Wang, M. A. Andrews, Z.-X. Wu, L. Wang, and C. T. Bauch. Coupled disease–behavior dynamics on complex networks: A review. Physics of Life Reviews, 15:1–29, 2015.
  • [27] U. Harush and B. Barzel. Dynamic patterns of information flow in complex networks. Nature Communications, 8:2181, 2017.
  • [28] C. Hens, U. Harush, S. Haber, R. Cohen, and B. Barzel. Spatiotemporal signal propagation in complex networks. Nature Physics, 15:403–412, 2019.
  • [29] R. M. D’Souza, M. di Bernardo, and Y.-Y. Liu. Controlling complex networks with complex nodes. Nature Reviews Physics, 5:250–262, 2023.
  • [30] J. P. B. O’Connor, E. O. Aboagye, J. E. Adams, H. J. W. L. Aerts, S. F. Barrington, A. J. Beer, R. Boellaard, S. E. Bohndiek, M. Brady, G. Brown, D. L. Buckley, T. L. Chenevert, L. P. Clarke, S. Collette, G. J. Cook, N. M. deSouza, J. C. Dickson, C. Dive, J. L. Evelhoch, C. Faivre-Finn, F. A. Gallagher, F. J. Gilbert, R. J. Gillies, V. Goh, J. R. Griffiths, A. M. Groves, S. Halligan, A. L. Harris, D. J. Hawkes, O. S. Hoekstra, E. P. Huang, B. F. Hutton, E. F. Jackson, G. C. Jayson, A. Jones, D.-M. Koh, D. Lacombe, P. Lambin, N. Lassau, M. O. Leach, T.-Y. Lee, E. L. Leen, J. S. Lewis, Y. Liu, M. F. Lythgoe, P. Manoharan, R. J. Maxwell, K. A. Miles, B. Morgan, S. Morris, T. Ng, A. R. Padhani, G. J. M. Parker, M. Partridge A. P. Pathak, A. C. Peet, S. Punwani, A. R. Reynolds S. P. Robinson, L. K. Shankar, R. A. Sharma, D. Soloviev, S. Stroobants, D. C. Sullivan, S. A. Taylor, P. S. Tofts, G. M. Tozer, M. van Herk, S. Walker-Samuel, J. Wason, K. J. Williams, P. Workman, T. E. Yankeelov, K. M. Brindle, L. M. McShane, A. Jackson, and J. C. Waterton. Imaging biomarker roadmap for cancer studies. Nature Reviews Clinical Oncology, 14:169–186, 2017.
  • [31] R. M. Califf. Biomarker definitions and their applications. Experimental Biology and Medicine, 243:213–221, 2018.
  • [32] B. Barzel, Y.-Y. Liu, and A.-L. Barab’asi. Constructing minimal models for complex system dynamics. Nature Communications, 6:7186, 2015.
  • [33] H.-T. Wai, A. Scaglione, B. Barzel, and A. Leshem. Joint network topology and dynamics recovery from perturbed stationary points. IEEE Transactions on Signal Processing, 67(17):4582–4596, 2019.
  • [34] Y.-C. Lai. Finding nonlinear system equations and complex network structures from data: A sparse optimization approach. Chaos, 31:082101, 2021.
  • [35] T.-T. Gao and G. Yan. Autonomous inference of complex network dynamics from incomplete and noisy data. Nature Computational Science, 2:160–168, 2022.
  • [36] J. Koch, Z. Chen, A. Tur, J. Drgona, and D. Vrabie. Structural inference of networked dynamical systems with universal differential equations. Chaos, 33:023103, 2023.
  • [37] Y.-Y. Liu, J.-J. Slotine, and A.-L. Barabási. Observability of complex systems. Proceedings of the National Academy of Sciences of the United States of America, 110(7):2460–2465, 2013.
  • [38] A.-L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999.
  • [39] M. E. J. Newman. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics, 46(5):323–351, 2005.
  • [40] I. Voitalov, P. Van Der Hoorn, R. Van Der Hofstad, and D. Krioukov. Scale-free networks well done. Physical Review Research, 1:033034, 2019.
  • [41] A. V. Goltsev, S. N. Dorogovtsev, J. G. Oliveira, and J. F. F. Mendes. Localization and spreading of diseases in complex networks. Physical Review Letters, 109(12):128702, 2012.
  • [42] R. Pastor-Satorras and C. Castellano. Distinct types of eigenvector localization in networks. Scientific Reports, 6:18847, 2016.
  • [43] R. Pastor-Satorras and A. Vespignani. Epidemic spreading in scale-free networks. Physical Review Letters, 86(14):3200–3203, 2001.
  • [44] R. Pastor-Satorras, C. Castellano, P. Van Mieghem, and A. Vespignani. Epidemic processes in complex networks. Reviews of Modern Physics, 87(3):925–979, 2015.
  • [45] S. L. Hale and J. L. Koprowski. Ecosystem-level effects of keystone species reintroduction: A literature review. Restoration Ecology, 26(3):439–445, 2018.
  • [46] L. S. Mills, M. E. Soulé, and D. F. Doak. The keystone-species concept in ecology and conservation: Management and policy must explicitly consider the complexity of interactions in natural systems. BioScience, 43(4):219–224, 1993.
  • [47] H. E. W. Cottee-Jones and R. J. Whittaker. Perspective: The keystone species concept: A critical appraisal. Frontiers of Biogeography, 4(3), 2012.
  • [48] E. Delmas, M. Besson, M.-H. Brice, L. A. Burkle, G. V. Dalla Riva, M.-J. Fortin, D. Gravel, P. R. Guimar aes Jr., D. H. Hembry, E. A. Newman, J. M. Olesen, M. M. Pires, J. D. Yeakel, and T. Poisot. Analysing ecological networks of species interactions. Biological Reviews, 94:16–36, 2019.
  • [49] M. Scheffer, J. Bascompte, W. A. Brock, V. Brovkin, S. R. Carpenter, V. Dakos, H. Held, E. H. Van Nes, M. Rietkerk, and G. Sugihara. Early-warning signals for critical transitions. Nature, 461:53–59, 2009.
  • [50] N. Wunderling, A. Staal, B. Sakschewski, M. Hirota, O. A. Tuinenburg, J. F. Donges, H. M. J. Barbosa, and R. Winkelmann. Recurrent droughts increase risk of cascading tipping events by outpacing adaptive capacities in the Amazon rainforest. Proceedings of the National Academy of Sciences of the United States of America, 119(32):e2120777119, 2022.
  • [51] N. E. Kouvaris, H. Kori, and A. S. Mikhailov. Traveling and pinned fronts in bistable reaction-diffusion systems on networks. PLOS ONE, 7(9):e45029, 2012.
  • [52] C. D. Brummitt, G. Barnett, and R. M. D’Souza. Coupled catastrophes: Sudden shifts cascade and hop among interdependent systems. Journal of The Royal Society Interface, 12:20150712, 2015.
  • [53] J. Krönke, N. Wunderling, R. Winkelmann, A. Staal, B. Stumpf, O. A. Tuinenburg, and J. F. Donges. Dynamics of tipping cascades on complex networks. Physical Review E, 101(4):042311, 2020.
  • [54] P. Kundu, H. Kori, and N. Masuda. Accuracy of a one-dimensional reduction of dynamical systems on networks. Physical Review E, 105(2):024305, 2022.
  • [55] P. Kundu, N. G. MacLaren, H. Kori, and N. Masuda. Mean-field theory for double-well systems on degree-heterogeneous networks. Proceedings of the Royal Society A, 478:20220350, 2022.
  • [56] Linda Petzold. Automatic selection of methods for solving stiff and nonstiff systems of ordinary differential equations. SIAM Journal on Scientific and Statistical Computing, 4(1):136–148, 1983.
  • [57] K. Soetaert, T. Petzoldt, and R. W. Setzer. Solving differential equations in R: Package deSolve. Journal of Statistical Software, 33(9):1–25, 2010.
  • [58] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten, and S. M. Dawson. The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology, 54:396–405, 2003.
  • [59] H. J. Ferreau, C. Kirches, A. Potschka, H. G. Bock, and M. Diehl. qpOASES: A parametric active-set algorithm for quadratic programming. Mathematical Programming Computation, 6:327–363, 2014.
  • [60] S. Theußl, F. Schwendinger, and K. Hornik. ROI: An extensible R optimization infrastructure. Journal of Statistical Software, 94(15):1–64, 2020.
  • [61] J. Kunegis. KONECT – The Koblenz Network Collection. In Proceedings of the 22nd International Conference on the World Wide Web, pages 1343–1350, 2013.
  • [62] L. Isella, J. Stehlé, A. Barrat, C. Cattuto, J.-F. Pinton, and W. Van den Broeck. What’s in a crowd? Analysis of face-to-face behavioral networks. Journal of Theoretical Biology, 271:166–180, 2011.
  • [63] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barabási. The large-scale organization of metabolic networks. Nature, 407:651–654, 2000.
  • [64] L. Šubelj and M. Bajec. Robust network community detection using balanced propagation. European Physical Journal B, 81:353–362, 2011.
  • [65] R. Guimerà, L. Danon, A. Díaz-Guilera, F. Giralt, and A. Arenas. Self-similar community structure in a network of human interactions. Physical Review E, 68(6):065103, 2003.
  • [66] A. A. Hagberg, D. A. Schult, and P. J. Swart. Exploring network structure, dynamics, and function using NetworkX. In G. Varoquaux, T. Vaught, and J. Millman, editors, Proceedings of the 7th Python in Science Conference, pages 11–15, Pasadena, CA USA, 2008.
  • [67] G. Csárdi, T. Nepusz, V. Traag, Sz. Horvát, F. Zanini, D. Noom, and K. Müller. igraph: Network Analysis and Visualization in R, 2024. R package version 2.0.3.
  • [68] P. Holme and B. J. Kim. Growing scale-free networks with tunable clustering. Physical Review E, 65(2):026107, 2002.
  • [69] K.-I. Goh, B. Kahng, and D. Kim. Universal behavior of load distribution in scale-free networks. Physical Review Letters, 87(27):278701, 2001.
  • [70] F. Chung and L. Lu. Connected components in random graphs with given expected degree sequences. Annals of Combinatorics, 6:125–145, 2002.
  • [71] Y. S. Cho, J. S. Kim, J. Park, B. Kahng, and D. Kim. Percolation transitions in scale-free networks under the Achlioptas process. Physical Review Letters, 103(13):135702, 2009.
  • [72] A. Lancichinetti, S. Fortunato, and F. Radicchi. Benchmark graphs for testing community detection algorithms. Physical Review E, 78(4):046110, 2008.
  • [73] A. Arenas, A. Díaz-Guilera, and C. J. Pérez-Vicente. Synchronization reveals topological scales in complex networks. Physical Review Letters, 96(11):114102, 2006.
  • [74] M. T. Schaub, J.-C. Delvenne, R. Lambiotte, and M. Barahona. Structured networks and coarse-grained descriptions: A dynamical perspective. In P. Doreian, V. Batagelj, and A. Ferligoj, editors, Advances in Network Clustering and Blockmodeling, pages 333–361. John Wiley & Sons, Ltd, 2020.
  • [75] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of Statistical Mechanics, 2008:P10008, 2008.
  • [76] Y. Takahata. Diachronic changes in the dominance relations of adult female Japanese monkeys of the Arashiyama B group. In L. M. Fedigan and P. J. Asquith, editors, The Monkeys of Arashiyama: Thirty-five Years of Research in Japan and the West, pages 123–139. State University of New York Press, Albany, NY, 1991.
  • [77] J. S. Coleman. Introduction to Mathematical Sociology. Collier-Macmillan, London, 1964.
  • [78] L. C. Freeman, C. M. Webster, and D. M. Kirke. Exploring social structure using dynamic three-dimensional color images. Social Networks, 20(2):109–118, 1998.
  • [79] T. P. Peixoto. The Netzschleuder network catalogue and repository, 2020. https://networks.skewed.de/, accessed 18 April 2024.
  • [80] L. C. Freeman, S. C. Freeman, and A. G. Michaelson. On human social intelligence. Journal of Social and Biological Structures, 11(4):415–425, 1988.
  • [81] B. Hayes. Connecting the dots. American Scientist, 94(5):400–404, 2006.
  • [82] R. B. Correia, L. P. de Araújo Kohler, M. M. Mattos, and L. M. Rocha. City-wide electronic health records reveal gender and age biases in administration of known drug–drug interactions. NPJ Digital Medicine, 2:74, 2019.
  • [83] M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical Review E, 74(3):036104, 2006.
  • [84] S. J. Cook, T. A. Jarrell, C. A. Brittin, Y. Wang, A. E. Bloniarz, M. A. Yakovlev, K. C. Q. Nguyen, L. T.-H. Tang, E. A. Bayer, J. S. Duerr, H. E. Bülow, O. Hobert, D. H. Hall, and S. W. Emmons. Whole-animal connectomes of both Caenorhabditis elegans sexes. Nature, 571:63–71, 2019.
  • [85] R. Hausmann, C. A. Hidalgo, S. Bustos, M. Coscia, A. Simoes, and M A. Yildirim. The Atlas of Economic Complexity: Mapping Paths to Prosperity. MIT Press, 2013.
  • [86] R. M. Thompson and C. R. Townsend. Impacts on stream food webs of native and exotic forest: An intercontinental comparison. Ecology, 84(1):145–161, 2003.
  • [87] J. Coleman, E. Katz, and H. Menzel. The diffusion of an innovation among physicians. Sociometry, 20(4):253–270, 1957.
  • [88] R. Michalski, S. Palus, and P. Kazienko. Matching organizational structure and social network extracted from email communication. In W. Abramowicz, editor, 14th International Conference on Business Information Systems, Lecture Notes in Business Information Processing, pages 197–206, 2011.
  • [89] L. Šubelj and M. Bajec. Community structure of complex software systems: Analysis and applications. Physica A, 390(16):2968–2975, 2011.
  • [90] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon. Network motifs in the transcriptional regulation network of Escherichia coli. Nature Genetics, 31:64–68, 2002.
  • [91] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824–827, 2002.
  • [92] Bureau of Transportation Statistics. T-100 domestic market, 2020. https://networks.skewed.de/net/us_air_traffic, accessed 18 April 2024.
  • [93] United States Federal Aviation Administration. Air traffic control system command center, 2010. https://networks.skewed.de/net/faa_routes, accessed 18 April 2024.

Supplementary Materials for:
Approximating nonlinear dynamics on networks by sentinel nodes

Neil G. MacLaren, Baruch Barzel, and Naoki Masuda

Appendix S1 Undirected and unweighted networks used in the analysis

In the main text, we have shown results from two undirected and unweighted networks, i.e., the dolphin network and a network generated by the Barabási-Albert (BA) model. We used eight other undirected and unweighted networks, of which four are empirical networks and the other four are model networks, in our statistical analyses.

We downloaded the five empirical networks, including the dolphin network described in the main text, from the KONECT repository [61] at http://konect.cc. The other four empirical networks are as follows:

Proximity

A network of visitors at a museum [62]. Each visitor is a node. Two nodes are adjacent (i.e., directly connected by an edge) if any face-to-face contact of 20 seconds or more was recorded between them. There are 69 days of recording [62]. We use the day with the largest number of contacts, which is the network provided in the KONECT repository. The network has N=410𝑁410N=410italic_N = 410, M=2,765𝑀2765M=2,765italic_M = 2 , 765, k¯=13.49¯𝑘13.49\overline{k}=13.49over¯ start_ARG italic_k end_ARG = 13.49, and CV of the degree equal to 0.620.620.620.62.

Metabolic

A metabolic network of the nematode Caenorhabditis elegans [63]. In this network, nodes are metabolic compounds. Two nodes are adjacent if one metabolite is a product of the other. Note that the original study treated this network as directed, but we use the undirected version here. This network has N=453𝑁453N=453italic_N = 453, M=2,025𝑀2025M=2,025italic_M = 2 , 025, k¯=8.94¯𝑘8.94\overline{k}=8.94over¯ start_ARG italic_k end_ARG = 8.94, and CV of the degree equal to 1.871.871.871.87.

Road

A network of major European roads [64]. Each node represents a city. Two cities are adjacent if there is a road connection between them. Our version of this network has N=1,039𝑁1039N=1,039italic_N = 1 , 039, M=1,305𝑀1305M=1,305italic_M = 1 , 305, k¯=2.51¯𝑘2.51\overline{k}=2.51over¯ start_ARG italic_k end_ARG = 2.51, and CV of the degree equal to 0.480.480.480.48.

Email

A network of email exchanges at the University of Rovira i Virgili [65]. Nodes in this network are email accounts. Two nodes are adjacent if at least one email was sent between them. Our version of this network has N=1,133𝑁1133N=1,133italic_N = 1 , 133, M=5,451𝑀5451M=5,451italic_M = 5 , 451, k¯=9.62¯𝑘9.62\overline{k}=9.62over¯ start_ARG italic_k end_ARG = 9.62, and CV of the degree equal to 0.970.970.970.97.

We used instances of five undirected random network models, including the BA model, with N=1,000𝑁1000N=1,000italic_N = 1 , 000 nodes. We removed any self-loops or multi-edges and retained the largest connected component. We generated the Erdős-Rényi (ER), BA, Holme-Kim (HK), and Lancichinetti-Fortunato-Radicchi (LFR) models using NetworkX [66], and the Gao-Kahng-Kim (GKK) model using igraph [67]. The BA network is described in the main text. The remaining four networks are as follows:

ER

We generated an undirected ER network with the probability of connecting two edges set to p=0.05𝑝0.05p=0.05italic_p = 0.05. The resulting network was connected and had N=1,000𝑁1000N=1,000italic_N = 1 , 000, M=25,132𝑀25132M=25,132italic_M = 25 , 132, k¯=50.26¯𝑘50.26\overline{k}=50.26over¯ start_ARG italic_k end_ARG = 50.26, and CV of the degree equal to 0.140.140.140.14.

HK

The HK model is a variation of the BA model that aims to produce high clustering (i.e., large density of triangles) [68]. We set m=2𝑚2m=2italic_m = 2 and a target local clustering coefficient of 0.10.10.10.1. The final network had N=1,000𝑁1000N=1,000italic_N = 1 , 000, M=1,996𝑀1996M=1,996italic_M = 1 , 996, k¯=3.99¯𝑘3.99\overline{k}=3.99over¯ start_ARG italic_k end_ARG = 3.99, CV of the degree equal to 1.401.401.401.40, and an average local clustering coefficient of 0.120.120.120.12.

GKK

The GKK model is a node fitness model [69]. A node is assigned a fitness fi=(i+i01)αsubscript𝑓𝑖superscript𝑖subscript𝑖01𝛼f_{i}=(i+i_{0}-1)^{-\alpha}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_i + italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 ) start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT, where i0=N11α[102(1α)]1αsubscript𝑖0superscript𝑁11𝛼superscriptdelimited-[]1021𝛼1𝛼i_{0}=N^{1-\frac{1}{\alpha}}\left[10\sqrt{2}(1-\alpha)\right]^{\frac{1}{\alpha}}italic_i start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_N start_POSTSUPERSCRIPT 1 - divide start_ARG 1 end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT [ 10 square-root start_ARG 2 end_ARG ( 1 - italic_α ) ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_α end_ARG end_POSTSUPERSCRIPT constrains the maximum degree [70, 71]. For each i,j{1,,N}𝑖𝑗1𝑁i,j\in\{1,\ldots,N\}italic_i , italic_j ∈ { 1 , … , italic_N }, edge (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) is present with probability fifj(=1Nf)2subscript𝑓𝑖subscript𝑓𝑗superscriptsuperscriptsubscript1𝑁subscript𝑓2\frac{f_{i}f_{j}}{(\sum_{\ell=1}^{N}f_{\ell})^{2}}divide start_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. We set α=1.25𝛼1.25\alpha=1.25italic_α = 1.25, N=1,000𝑁1000N=1,000italic_N = 1 , 000, and M=2,500𝑀2500M=2,500italic_M = 2 , 500. The largest connected component of the generated network had N=949𝑁949N=949italic_N = 949, M=2,496𝑀2496M=2,496italic_M = 2 , 496, k¯=5.26¯𝑘5.26\overline{k}=5.26over¯ start_ARG italic_k end_ARG = 5.26, and CV of the degree equal to 0.840.840.840.84.

LFR

The LFR benchmark model produces networks that have both heterogeneous degree distributions and community structure with heterogeneous community sizes [72]. We set the expected power-law exponents to 33-3- 3 and 1.51.5-1.5- 1.5 for the degree distribution and the distribution of community sizes, respectively. We set the probability of connecting nodes between communities to 0.1, the expected average degree to 4, and the minimum community size to 20. The final network had N=998𝑁998N=998italic_N = 998, M=1,988𝑀1988M=1,988italic_M = 1 , 988, k¯=3.98¯𝑘3.98\overline{k}=3.98over¯ start_ARG italic_k end_ARG = 3.98, and CV of the degree equal to 0.660.660.660.66.

Appendix S2 Approximation error as a function of n𝑛nitalic_n

In the main text (section II.2), we showed that the degree distributions of optimized node sets remain distinct from those of completely random node sets up to n=12𝑛12n=12italic_n = 12 for the dolphin and BA networks. Figure S6 indicates that the approximation errors of optimized node sets also remain distinct from and lower than those of completely random node sets. We note that the approximation error of both optimized and completely random node sets decreases as n𝑛nitalic_n increases, which is expected.

Refer to caption
Figure S6: Approximation error as a function of node set size for two networks. The blue circles show the average approximation error over the 100 optimized node sets. The green squares show the average approximation error over 100 completely random node sets. The confidence interval for each average is smaller than the corresponding marker and is therefore not shown. (a) Dolphin network. (b) BA network.

Appendix S3 Distribution of approximation errors for various networks

In Figs. 2(a) and (b) in the main text, we demonstrated that the state averaged over the nodes in the optimized node set can closely approximate the state averaged over all the nodes in the coupled double-well dynamics on the dolphin and BA networks, respectively. We show the corresponding results for all ten networks in Fig. S7, as well as for the mutualistic interaction, SIS, and gene-regulatory dynamics in Figs. S8, S9, and S10, respectively. Figures S7(a) and (g) are identical to Figs. 2(a) and (b), respectively.

Across all dynamics and networks, the optimized node sets have relatively small approximation error. However, the relationship between degree-preserving and completely random node sets differs. In some networks, the difference between the degree-preserving and completely random node sets is large, such as for the dolphin network (Fig. S7a). However, it is not the case for the road network (Fig. S7d) and the LFR network (Fig. S7j). This latter result may be because the degree sequence may not contain sufficient information to characterize these networks: the road network has a complicated community structure imposed by geography [64], and the LFR network has a marked community structure by design [72]. Either of these cases may make particular nodes more suitable or less so for inclusion in S𝑆Sitalic_S for reasons other than the node’s degree. In addition, the road network has a narrow degree distribution, probably making the degree a less important indicator of a node’s importance in representing the dynamics of the entire network. Even in these cases, our algorithm finds node sets with relatively low approximation error.

Refer to caption
Figure S7: Approximation error for the coupled double-well dynamics for node sets with n=lnN𝑛𝑁n=\lfloor\ln N\rflooritalic_n = ⌊ roman_ln italic_N ⌋ nodes. We compare among the optimized node sets (shown by the blue circles), degree-preserving random node sets (orange triangles), and completely random node sets (green squares). (a) Dolphin. (b) Metabolic. (c) Proximity. (d) Road. (e) Email. (f) ER. (g) BA. (h) HK. (i) GKK. (j) LFR. Panels (a) and (g) are also shown in Fig. 2 in the main text.
Refer to caption
Figure S8: Approximation error for the mutualistic interaction dynamics. See the caption of Fig. S7 for the legends.
Refer to caption
Figure S9: Approximation error for the SIS dynamics. See the caption of Fig. S7 for the legends.
Refer to caption
Figure S10: Approximation error for the gene-regulatory dynamics. See the caption of Fig. S7 for the legends.

Appendix S4 Statistical results for the effect of node set type on approximation error

To systematically assess the performance of our optimization algorithm across different dynamics and networks, we generated 100 optimized node sets by running the optimization algorithm 100 times, for each combination of dynamics and network. For comparison, we also generated 100 degree-preserving node sets, one per optimized node set, 100 completely random node sets for each network, and 100 node sets from each heuristic algorithm introduced in section S6 (k𝑘kitalic_k-constrained, k𝑘kitalic_k-quantiled, knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained, and community-based) for each network. Note that neither the completely random nor heuristic algorithm node sets depend on the dynamics.

We computed the approximation error for each node set obtained by each of the seven algorithms, each of the four dynamics, and each of the ten networks. To compare the approximation error, we conducted a multi-way analysis of variance (ANOVA) with three independent variables, i.e., dynamics (reference: coupled double-well dynamics), network (reference: BA network), and node set type (reference: completely random node sets). We exclude the ER network because all node sets achieve small approximation errors on this network, probably due to limited heterogeneity in the degree distribution. The dependent, or response, variable in the model is the approximation error, ε𝜀\varepsilonitalic_ε. We use lnε𝜀\ln\varepsilonroman_ln italic_ε because tiny (i.e., near zero) error values are present; the error values obey skewed distributions for most dynamics, networks, and node set types; and the variance of ε𝜀\varepsilonitalic_ε tends to be large when the mean of ε𝜀\varepsilonitalic_ε is large.

Overall, our ANOVA model predicts lnε𝜀\ln\varepsilonroman_ln italic_ε well (R2=0.70superscript𝑅20.70R^{2}=0.70italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.70). In part due to the large sample size (i.e., 25,200 observations), all three independent variables are highly significant (dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=7017.89𝐹7017.89F=7017.89italic_F = 7017.89, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT, where the F𝐹Fitalic_F statistic is the mean-square error of the factor divided by the mean-square error of the residuals; network: df=8𝑑𝑓8df=8italic_d italic_f = 8, F=175.08𝐹175.08F=175.08italic_F = 175.08, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=6𝑑𝑓6df=6italic_d italic_f = 6, F=6091.11𝐹6091.11F=6091.11italic_F = 6091.11, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT).

We show the differences between the average error for each pair of node set types, as computed by a Tukey’s honestly significant difference test, in Table S1. Each row of Table S1 specifies the estimated difference in average lnε𝜀\ln\varepsilonroman_ln italic_ε between the two node set types, the 95% confidence interval of the estimated mean difference, and a p𝑝pitalic_p-value for the difference adjusted for multiple comparisons. For example, the first row states that, across dynamics and networks, optimized node sets have an average lnε𝜀\ln\varepsilonroman_ln italic_ε value that is 5.625 smaller than that of completely random node sets. On the natural scale, the average ε𝜀\varepsilonitalic_ε for optimized node sets is 1/e5.625=277.31superscript𝑒5.625277.31/e^{-5.625}=277.31 / italic_e start_POSTSUPERSCRIPT - 5.625 end_POSTSUPERSCRIPT = 277.3 times smaller than the average ε𝜀\varepsilonitalic_ε for completely random node sets.

Table S1: Differences in average lnε𝜀\ln\varepsilonroman_ln italic_ε between different node set types when we use the test dynamics for optimization. The differences are computed by a Tukey’s honestly significant difference test. The 95% confidence intervals (CI) of the differences and associated p𝑝pitalic_p-values, adjusted for multiple comparisons, are also shown.
Difference CI p𝑝pitalic_p
Optimized -- Random 5.625-5.625-5.625- 5.625 [5.737,5.512]5.7375.512[-5.737,-5.512][ - 5.737 , - 5.512 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Random 2.345-2.345-2.345- 2.345 [2.457,2.233]2.4572.233[-2.457,-2.233][ - 2.457 , - 2.233 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Random 0.170-0.170-0.170- 0.170 [0.282,0.058]0.2820.058-[0.282,-0.058]- [ 0.282 , - 0.058 ] 1.62×1041.62superscript1041.62\times 10^{-4}1.62 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Random 0.138-0.138-0.138- 0.138 [0.250,0.025]0.2500.025[-0.250,-0.025][ - 0.250 , - 0.025 ] 5.59×1035.59superscript1035.59\times 10^{-3}5.59 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Random 0.262-0.262-0.262- 0.262 [0.375,0.150]0.3750.150[-0.375,-0.150][ - 0.375 , - 0.150 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Random 0.157-0.157-0.157- 0.157 [0.269,0.045]0.2690.045[-0.269,-0.045][ - 0.269 , - 0.045 ] 7.40×1047.40superscript1047.40\times 10^{-4}7.40 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Degree-preserving -- Optimized 3.2803.2803.2803.280 [3.168,3.392]3.1683.392[3.168,3.392][ 3.168 , 3.392 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Optimized 5.4555.4555.4555.455 [5.343,5.567]5.3435.567[5.343,5.567][ 5.343 , 5.567 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Optimized 5.4875.4875.4875.487 [5.375,5.599]5.3755.599[5.375,5.599][ 5.375 , 5.599 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Optimized 5.3625.3625.3625.362 [5.250,5.474]5.2505.474[5.250,5.474][ 5.250 , 5.474 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Optimized 5.4685.4685.4685.468 [5.356,5.580]5.3565.580[5.356,5.580][ 5.356 , 5.580 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Degree-preserving 2.1752.1752.1752.175 [2.063,2.287]2.0632.287[2.063,2.287][ 2.063 , 2.287 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Degree-preserving 2.2072.2072.2072.207 [2.095,2.320]2.0952.320[2.095,2.320][ 2.095 , 2.320 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Degree-preserving 2.0822.0822.0822.082 [1.970,2.195]1.9702.195[1.970,2.195][ 1.970 , 2.195 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Degree-preserving 2.1882.1882.1882.188 [2.076,2.300]2.0762.300[2.076,2.300][ 2.076 , 2.300 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- k𝑘kitalic_k-constrained 0.0320.0320.0320.032 [0.080,0.145]0.0800.145[-0.080,0.145][ - 0.080 , 0.145 ] 0.9790.9790.9790.979
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-constrained 0.092-0.092-0.092- 0.092 [0.205,0.020]0.2050.020[-0.205,0.020][ - 0.205 , 0.020 ] 0.1860.1860.1860.186
Community-based -- k𝑘kitalic_k-constrained 0.0130.0130.0130.013 [0.099,0.125]0.0990.125[-0.099,0.125][ - 0.099 , 0.125 ] 0.9990.9990.9990.999
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-quantiled 0.125-0.125-0.125- 0.125 [0.237,0.013]0.2370.013[-0.237,-0.013][ - 0.237 , - 0.013 ] 0.0180.0180.0180.018
Community-based -- k𝑘kitalic_k-quantiled 0.019-0.019-0.019- 0.019 [0.132,0.093]0.1320.093[-0.132,0.093][ - 0.132 , 0.093 ] 0.9990.9990.9990.999
Community-based -- knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained 0.1060.1060.1060.106 [0.007,0.218]0.0070.218[-0.007,0.218][ - 0.007 , 0.218 ] 0.0810.0810.0810.081

The average lnε𝜀\ln\varepsilonroman_ln italic_ε of any node set type is smaller than that of completely random node sets (1st–6th rows of Table S1). In fact, optimized node sets have the smallest error (1st and 7th–11th rows), followed by the degree-preserving node sets (2nd, 7th, and 12th–15th rows), followed by the node sets generated by the heuristic algorithms (remaining rows). The node sets generated by the heuristic algorithms are in general poorly differentiated within them (last six rows of Table S1).

Appendix S5 Kullback-Leibler divergence

We measure how different a sampled degree distribution is from the original degree distribution of the network with the Kullback-Leibler divergence, DKLsubscript𝐷KLD_{\rm KL}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT. The Kullback-Leibler divergence is defined as

DKL(QP)=k𝒦Q(k)log[Q(k)P(k)],subscript𝐷KLconditional𝑄𝑃subscript𝑘𝒦𝑄𝑘𝑄𝑘𝑃𝑘D_{\rm KL}(Q\|P)=\sum_{k\in\mathcal{K}}Q(k)\log\left[\frac{Q(k)}{P(k)}\right],italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_Q ∥ italic_P ) = ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT italic_Q ( italic_k ) roman_log [ divide start_ARG italic_Q ( italic_k ) end_ARG start_ARG italic_P ( italic_k ) end_ARG ] , (S10)

where P(k)𝑃𝑘P(k)italic_P ( italic_k ) is the degree distribution of the original network (i.e., the proportion of nodes having degree k𝑘kitalic_k), Q(k)𝑄𝑘Q(k)italic_Q ( italic_k ) is the degree distribution based on the set of sampled nodes, and 𝒦𝒦\mathcal{K}caligraphic_K is the set of unique degree values in the original network. To compute Q(k)𝑄𝑘Q(k)italic_Q ( italic_k ), we first generated 100 optimized node sets with a given n𝑛nitalic_n. Then, we calculated the proportion of nodes, across the 100 node sets, which had each degree in 𝒦𝒦\mathcal{K}caligraphic_K.

Appendix S6 Effects of structural features on the inclusion of nodes in optimized node sets

In the main text, we analyzed the degree distributions of optimized sentinel node sets in comparison with the degree distribution of completely randomly node sets. Other structural features of the network may also play roles in making nodes suitable for inclusion in optimized node sets. In this section, we investigate three such features.

S6.1 Effects of three structural features on the approximation error

First, we examined the effect of the average nearest neighbor degrees of the nodes in the optimized node sets. The average nearest neighbor degree of an i𝑖iitalic_ith node is given by

knn,i=1kij=1;j𝒩iNkjsubscript𝑘nn𝑖1subscript𝑘𝑖superscriptsubscriptformulae-sequence𝑗1𝑗subscript𝒩𝑖𝑁subscript𝑘𝑗k_{{\rm nn},i}=\frac{1}{k_{i}}\sum_{j=1;j\in\mathcal{N}_{i}}^{N}k_{j}italic_k start_POSTSUBSCRIPT roman_nn , italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 ; italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (S11)

where 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of the kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT neighbors of the i𝑖iitalic_ith node. We compute the average of knn,isubscript𝑘nn𝑖k_{{\rm nn},i}italic_k start_POSTSUBSCRIPT roman_nn , italic_i end_POSTSUBSCRIPT over the n𝑛nitalic_n nodes in the node set S𝑆Sitalic_S by

knn=1ni=1;iSNknn,isubscript𝑘nn1𝑛superscriptsubscriptformulae-sequence𝑖1𝑖𝑆𝑁subscript𝑘nn𝑖k_{\rm nn}=\frac{1}{n}\sum_{i=1;i\in S}^{N}k_{{\rm nn},i}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 ; italic_i ∈ italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT roman_nn , italic_i end_POSTSUBSCRIPT (S12)

for each optimized and completely random node set. We show in Fig. S11 the distribution of knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT in the optimized and completely random node sets for the coupled double-well dynamics on the ten networks used in Fig. S7. The figure suggests that nodes in optimized node sets tend to have larger, or at least not smaller, knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT than completely random node sets on average. Qualitatively, this observation suggests that our algorithm prefers nodes that “see” more of the network, i.e., nodes with neighbors receiving input from a larger portion of the network.

Refer to caption
Figure S11: Distribution of the average nearest neighbor degree of the nodes in sentinel and completely random node sets. The blue and green bars represent optimized and completely random node sets, respectively. We generated 100 node sets of each type to generate each distribution shown. Each panel represents a network. (a) Dolphin. (b) Proximity. (c) Metabolic. (d) Road. (e) Email. (f) ER. (g) BA. (h) HK. (i) GKK. (j) LFR.

Second, we examined the effect of the local clustering coefficient. The local clustering coefficient of an i𝑖iitalic_ith node is given by [2]

Ci=number of i’s neighbors that are adjacent to each other12ki(ki1).subscript𝐶𝑖number of 𝑖’s neighbors that are adjacent to each other12subscript𝑘𝑖subscript𝑘𝑖1C_{i}=\frac{\text{number of }i\text{'s neighbors that are adjacent to each % other}}{\frac{1}{2}k_{i}(k_{i}-1)}.italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG number of italic_i ’s neighbors that are adjacent to each other end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ) end_ARG . (S13)

We computed the average of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over the n𝑛nitalic_n nodes in node set S𝑆Sitalic_S by

Clocal=1ni=1;iSNCisubscript𝐶local1𝑛superscriptsubscriptformulae-sequence𝑖1𝑖𝑆𝑁subscript𝐶𝑖C_{\text{local}}=\frac{1}{n}\sum_{i=1;i\in S}^{N}C_{i}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 ; italic_i ∈ italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (S14)

for each optimized and completely random node set. We show the distribution of Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT in the optimized and completely random node sets for the coupled double-well dynamics on various networks in Fig. S12. The figure indicates that Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT does not much differ between optimized and completely random node sets.

Refer to caption
Figure S12: Distribution of the average local clustering coefficient in sentinel and completely random node sets. In each panel, we show a binned distribution of Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT for a network. (a) Dolphin. (b) Proximity. (c) Metabolic. (d) Road. (e) Email. (f) ER. (g) BA. (h) HK. (i) GKK. (j) LFR.

Third, we hypothesized that nodes in optimized node sets are more likely to come from different communities than those in completely random node sets. This hypothesis is intuitive because nodes in the same community are expected to show relatively similar dynamics [73, 10, 74, 14], and therefore it is probably redundant to devote multiple sentinel nodes to a single community when one can only use a small number of sentinel nodes. To test the hypothesis, we confined ourselves to six networks, i.e., the five empirical networks and the LFR network, which have relatively strong community structure. We obtained a partition of each network into communities using the Louvain algorithm [75]. Then, for each node set, we counted the number of nodes in the same community and recorded the maximum such number over all the communities, which we denote by K𝐾Kitalic_K. In Fig. S13, we show the distribution of K𝐾Kitalic_K for the optimized and completely random node sets. We find that the optimized node sets seem to have a somewhat smaller number of sentinel nodes coming from the same community (thus, smaller K𝐾Kitalic_K) than the randomized node sets, supporting our claim, albeit weakly in appearance.

Refer to caption
Figure S13: The maximum number of nodes in each node set coming from the same community. (a) Dolphin. (b) Proximity. (c) Metabolic. (d) Road. (e) Email. (f) LFR.

To formalize these observations, we attempted to classify each node set as either optimized or completely random based only on structural features of the node set. A successful classification would imply that one can likely construct a high-performance node set by relying on these node quantities and avoid running our optimization algorithm. We used logistic regression for this classification task. Specifically, we use a generalized linear model with binomial errors and a logit link function in which the dependent variable is the binary outcome of a trial (i.e., either a node set is completely random, which is the reference, or it is optimized). The independent variables are the simulation conditions (i.e., dynamics and network); k𝑘kitalic_k, the average degree of the n𝑛nitalic_n nodes in the node set; knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT, the average nearest neighbor degree (i.e., knn,isubscript𝑘nn𝑖k_{{\rm nn},i}italic_k start_POSTSUBSCRIPT roman_nn , italic_i end_POSTSUBSCRIPT) averaged over the n𝑛nitalic_n nodes (see Eqs. (S11) and (S12)); Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT, the average local clustering coefficient of the n𝑛nitalic_n nodes (see Eqs. (S13) and (S14)); and K𝐾Kitalic_K, the maximum number of nodes in a node set assigned to the same community. We analyzed the same number of node sets of each type: 100 optimized node sets for each combination of dynamics and network, and an equal number of completely random node sets (i.e., 400 random node sets for each network). We again exclude the ER network because of its relatively homogeneous degree distribution.

We show the results of this analysis in Table S2. We find that the classification is unsuccessful; our model explains less than 1% of the variance in classification of a node set as random or optimized (McFadden’s pseudo-R2=0.008superscript𝑅20.008R^{2}=0.008italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.008). However, the coefficients on each node quantity are statistically significant and consistent with our analysis above: optimized node sets are associated with smaller k𝑘kitalic_k (model coefficient b=0.055𝑏0.055b=-0.055italic_b = - 0.055, p=3.62×106𝑝3.62superscript106p=3.62\times 10^{-6}italic_p = 3.62 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT), larger knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT (b=0.023𝑏0.023b=0.023italic_b = 0.023, p=1.19×104𝑝1.19superscript104p=1.19\times 10^{-4}italic_p = 1.19 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT), smaller Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT (b=1.10𝑏1.10b=-1.10italic_b = - 1.10, p=1.73×104𝑝1.73superscript104p=1.73\times 10^{-4}italic_p = 1.73 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT), and smaller K𝐾Kitalic_K (b=0.228𝑏0.228b=-0.228italic_b = - 0.228, p<×107p<\times 10^{-7}italic_p < × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT). Note that the coefficient on Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT is the least extreme among the four (|z|=3.76𝑧3.76|z|=3.76| italic_z | = 3.76), consistent with our analysis above, while the p𝑝pitalic_p-value is small due to a large sample size. Additionally, the controls for dynamics in this logistic regression are not significant, suggesting that the composition of optimized node sets may be more strongly related to network structure than dynamics. In sum, our model supports that, across dynamics and networks, optimized node sets avoid large-degree nodes, include nodes with better connected neighbors which do not in turn connect with each other, and avoid nodes that come from the same community. However, all these effects are weak.

Table S2: Regression results in search for features of the optimized node set. The dependent variable is whether the node set is completely random (the reference) or optimized. The model is a logistic regression, i.e., a generalized linear model with binomial errors and a logit link function. SE: standard error, z𝑧zitalic_z: coefficient estimate divided by the standard error.
Estimate SE z𝑧zitalic_z p𝑝pitalic_p
Intercept 0.8650.8650.8650.865 0.1590.1590.1590.159 5.445.445.445.44 5.21×1085.21superscript1085.21\times 10^{-8}5.21 × 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT
Dynamics
       Mutualistic species 0.0050.0050.0050.005 0.0670.0670.0670.067 0.070.070.070.07 0.9430.9430.9430.943
       SIS 0.018-0.018-0.018- 0.018 0.0670.0670.0670.067 0.27-0.27-0.27- 0.27 0.7870.7870.7870.787
       Gene-regulatory 0.0040.0040.0040.004 0.0670.0670.0670.067 0.060.060.060.06 0.9500.9500.9500.950
Network
       Proximity 0.5160.5160.5160.516 0.1620.1620.1620.162 3.183.183.183.18 0.0010.0010.0010.001
       Metabolic 0.391-0.391-0.391- 0.391 0.2940.2940.2940.294 1.33-1.33-1.33- 1.33 0.1830.1830.1830.183
       Road 0.429-0.429-0.429- 0.429 0.1370.1370.1370.137 3.13-3.13-3.13- 3.13 0.0020.0020.0020.002
       Email 0.0170.0170.0170.017 0.1330.1330.1330.133 0.130.130.130.13 0.8970.8970.8970.897
       BA 0.529-0.529-0.529- 0.529 0.1410.1410.1410.141 3.75-3.75-3.75- 3.75 1.76×1041.76superscript1041.76\times 10^{-4}1.76 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
       HK 0.450-0.450-0.450- 0.450 0.1270.1270.1270.127 3.54-3.54-3.54- 3.54 4.07×1044.07superscript1044.07\times 10^{-4}4.07 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
       GKK 0.363-0.363-0.363- 0.363 0.1340.1340.1340.134 2.72-2.72-2.72- 2.72 0.0070.0070.0070.007
       LFR 0.294-0.294-0.294- 0.294 0.1110.1110.1110.111 2.65-2.65-2.65- 2.65 0.0080.0080.0080.008
k𝑘kitalic_k 0.055-0.055-0.055- 0.055 0.0120.0120.0120.012 4.63-4.63-4.63- 4.63 3.62×1063.62superscript1063.62\times 10^{-6}3.62 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT 0.0230.0230.0230.023 0.0060.0060.0060.006 3.853.853.853.85 1.19×1041.19superscript1041.19\times 10^{-4}1.19 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT 1.100-1.100-1.100- 1.100 0.2930.2930.2930.293 3.76-3.76-3.76- 3.76 1.73×1041.73superscript1041.73\times 10^{-4}1.73 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
K𝐾Kitalic_K 0.228-0.228-0.228- 0.228 0.0400.0400.0400.040 5.73-5.73-5.73- 5.73 <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Null deviance 9981 df𝑑𝑓dfitalic_d italic_f: 7199
Residual deviance 9903 df𝑑𝑓dfitalic_d italic_f: 7184
Pseudo-R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0.008

S6.2 Heuristic algorithms for node set selection

To explore the possibility of using the information obtained from these analyses to eliminate the need for our optimization algorithm, we designed four heuristic algorithms for node set selection based on our observations. These algorithms do not depend on the dynamics and only use the information on the network structure to different extents. We avoided algorithms using Clocalsubscript𝐶localC_{\text{local}}italic_C start_POSTSUBSCRIPT local end_POSTSUBSCRIPT because, as we showed with Fig. S12, its distribution was apparently indifferent between the optimized and completely random node sets for at least some networks.

For the first algorithm, which we call “k𝑘kitalic_k-constrained,” we reject the largest 5% of nodes in terms of degree and select n𝑛nitalic_n nodes uniformly at random without replacement from the remaining 95% of nodes. We also reject the 5% largest-degree nodes before selecting n𝑛nitalic_n nodes without replacement in each of the following algorithms.

In the “k𝑘kitalic_k-quantiled” algorithm, we select one node uniformly at random from each of n𝑛nitalic_n divisons of the degree distribution. For example, if n=4𝑛4n=4italic_n = 4, then we select one node from the smallest 25% of nodes in terms of degree, one node from the 25th–50th percent of nodes, and so on, among the 95% of nodes with the smallest degrees.

In the “knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained” algorithm, we reject the bottom 5% of nodes in terms of knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT and select nodes uniformly at random and without replacement from the remaining nodes.

Finally, the “community-based” algorithm runs as follows. First, we partition the network into communities using the Louvain algorithm [75]. Then, we select nodes softly avoiding multiple nodes from the same community. Specifically, we select the i𝑖iitalic_ith community with probability ci/j=1nccjsubscript𝑐𝑖superscriptsubscript𝑗1subscript𝑛csubscript𝑐𝑗\sqrt{c_{i}}/\sum_{j=1}^{n_{\text{c}}}\sqrt{c_{j}}square-root start_ARG italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG / ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT square-root start_ARG italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG, where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of nodes in the i𝑖iitalic_ith community and ncsubscript𝑛cn_{\text{c}}italic_n start_POSTSUBSCRIPT c end_POSTSUBSCRIPT is the number of communities in the network, then select one node uniformly at random without replacement from the i𝑖iitalic_ith community. We repeat this procedure n𝑛nitalic_n times. Note that the unbiased sampling of nodes would select each community with probability ci/j=1nccjsubscript𝑐𝑖superscriptsubscript𝑗1subscript𝑛csubscript𝑐𝑗c_{i}/\sum_{j=1}^{n_{\text{c}}}c_{j}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Therefore, nodes in large communities are underrepresented in our community-based node set such that selection of multiple nodes from the same community is discouraged.

We found that selecting node sets according to these algorithms yielded smaller approximation error than with completely random node sets. Specifically, the k𝑘kitalic_k-constrained, k𝑘kitalic_k-quantiled, knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained, and community-based node sets have 1.19, 1.15, 1.30, and 1.17 times smaller approximation error, respectively, than the completely random node sets (see Table S1). This result verifies that the structural properties that our optimization algorithm emphasizes, i.e., degree, average nearest-neighbor degree, and community membership, characterize good sentinel nodes. However, their contribution to suppressing the approximation error is modest. In contrast, one can achieve much smaller approximation error by actually running our optimization algorithm, suggesting that our optimization algorithm determines efficient sentinel node sets by exploiting more complex information on the given network than commonly known properties such as the node degrees and community structure. These observations also hold true when we do not know the test dynamics (Table S3) and for weighted (Tables S6 and S7) and directed (Tables S8 and S9) networks.

Appendix S7 Evaluating approximation error on an alternative dynamics

We have assumed that we can use the equations of the actual dynamical system in order to optimize node set selection. This condition is unlikely to hold in practice. In this section, we assess the consequences of optimizing with a proposed, or training, dynamics that is not the actual, or test, dynamics. Our procedure works as follows. For a given network, we pretend that we do not know the test dynamics (e.g., SIS dynamics) and therefore run a training dynamics (e.g., coupled double-well dynamics). Then, we analyze the approximation error obtained by the node sets we optimized with the training dynamics when the actual dynamics are the test dynamics (see Figure 5a in the main text).

We followed this procedure for each combination of training dynamics, test dynamics, network, and node set type. Specifically, for each network, we used the 100 node sets optimized on the training dynamics (e.g., coupled double-well dynamics) to calculate approximation errors on each test dynamics (i.e., SIS dynamics, gene-regulatory dynamics, and mutualistic species dynamics if the training dynamics is the coupled double-well dynamics). We carry out this procedure for each pair of different training and test dynamics, each network, and each node set type. Note that only the optimized and degree-preserving node sets use information on the dynamics. Therefore, there are 100 optimized and degree-preserving node sets for each pair of training dynamics and network, but 100 node sets for each network for each of the other five types of node set. As in section S4, we use lnε𝜀\ln\varepsilonroman_ln italic_ε as the dependent variable and built a generalized linear model with Gaussian errors and an identity link function. We excluded data from the ER network for the same reason as that stated in section S4.

We analyzed the obtained lnε𝜀\ln\varepsilonroman_ln italic_ε values by a multi-way ANOVA as in section S4. Even without knowledge of the test dynamics, our model explains a substantial portion of the variance in lnε𝜀\ln\varepsilonroman_ln italic_ε (R2=0.58superscript𝑅20.58R^{2}=0.58italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.58), although it is less than when we do know the test dynamics (see section S4). Each of the four independent variables is highly significant (training dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=2980.2𝐹2980.2F=2980.2italic_F = 2980.2, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; test dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=24920.9𝐹24920.9F=24920.9italic_F = 24920.9, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; network: df=8𝑑𝑓8df=8italic_d italic_f = 8, F=643.8𝐹643.8F=643.8italic_F = 643.8, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=6𝑑𝑓6df=6italic_d italic_f = 6, F=2643.2𝐹2643.2F=2643.2italic_F = 2643.2, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT). Note that the large number (i.e., 75,6007560075,60075 , 600) of observations in part explains the small p𝑝pitalic_p-values.

Table S3 shows the differences in average lnε𝜀\ln\varepsilonroman_ln italic_ε, estimated by Tukey’s honestly significant difference test. Despite not knowing the test dynamics, optimized node sets are still associated with the smallest ε𝜀\varepsilonitalic_ε, followed by the degree-preserving, heuristic algorithms, and completely random node sets. Although the difference between several pairs of the four heuristic algorithms in terms of lnε𝜀\ln\varepsilonroman_ln italic_ε is statistically significant, the magnitude of the differences among them is small.

Table S3: Differences in average lnε𝜀\ln\varepsilonroman_ln italic_ε between different node set types when we do not know the test dynamics. The differences are computed by a Tukey’s honestly significant difference test. The 95% confidence intervals (CI) of the differences and associated p𝑝pitalic_p-values, adjusted for multiple comparisons, are also shown.
Difference CI p𝑝pitalic_p
Optimized -- Random 1.748-1.748-1.748- 1.748 [1.804,1.692]1.8041.692[-1.804,-1.692][ - 1.804 , - 1.692 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Random 1.310-1.310-1.310- 1.310 [1.366,1.254]1.3661.254[-1.366,-1.254][ - 1.366 , - 1.254 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Random 0.170-0.170-0.170- 0.170 [0.226,0.114]0.2260.114[-0.226,-0.114][ - 0.226 , - 0.114 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Random 0.138-0.138-0.138- 0.138 [0.194,0.082]0.1940.082[-0.194,-0.082][ - 0.194 , - 0.082 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Random 0.262-0.262-0.262- 0.262 [0.319,0.206]0.3190.206[-0.319,-0.206][ - 0.319 , - 0.206 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Random 0.157-0.157-0.157- 0.157 [0.213,0.101]0.2130.101[-0.213,-0.101][ - 0.213 , - 0.101 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Optimized 0.4380.4380.4380.438 [0.381,0.494]0.3810.494[0.381,0.494][ 0.381 , 0.494 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Optimized 1.5781.5781.5781.578 [1.522,1.634]1.5221.634[1.522,1.634][ 1.522 , 1.634 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Optimized 1.6101.6101.6101.610 [1.554,1.666]1.5541.666[1.554,1.666][ 1.554 , 1.666 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Optimized 1.4851.4851.4851.485 [1.429,1.541]1.4291.541[1.429,1.541][ 1.429 , 1.541 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Optimized 1.5911.5911.5911.591 [1.535,1.647]1.5351.647[1.535,1.647][ 1.535 , 1.647 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Degree-preserving 1.1401.1401.1401.140 [1.084,1.196]1.0841.196[1.084,1.196][ 1.084 , 1.196 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Degree-preserving 1.1731.1731.1731.173 [1.117,1.229]1.1171.229[1.117,1.229][ 1.117 , 1.229 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Degree-preserving 1.0481.0481.0481.048 [0.992,1.104]0.9921.104[0.992,1.104][ 0.992 , 1.104 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Degree-preserving 1.1531.1531.1531.153 [1.097,1.209]1.0971.209[1.097,1.209][ 1.097 , 1.209 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- k𝑘kitalic_k-constrained 0.0320.0320.0320.032 [0.024,0.088]0.0240.088[-0.024,0.088][ - 0.024 , 0.088 ] 0.6130.6130.6130.613
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-constrained 0.092-0.092-0.092- 0.092 [0.149,0.036]0.1490.036[-0.149,-0.036][ - 0.149 , - 0.036 ] 2.37×1052.37superscript1052.37\times 10^{-5}2.37 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Community-based -- k𝑘kitalic_k-constrained 0.0130.0130.0130.013 [0.043,0.069]0.0430.069[-0.043,0.069][ - 0.043 , 0.069 ] 0.9930.9930.9930.993
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-quantiled 0.125-0.125-0.125- 0.125 [0.181,0.069]0.1810.069[-0.181,-0.069][ - 0.181 , - 0.069 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- k𝑘kitalic_k-quantiled 0.019-0.019-0.019- 0.019 [0.075,0.037]0.0750.037[-0.075,0.037][ - 0.075 , 0.037 ] 0.9500.9500.9500.950
Community-based -- knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained 0.1060.1060.1060.106 [0.049,0.162]0.0490.162[0.049,0.162][ 0.049 , 0.162 ] 6×1076superscript1076\times 10^{-7}6 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT

Appendix S8 Optimizing node weights

In this section, we analyze the effect of optimizing node weights in additional to optimizing node selection.

S8.1 An example

In Fig. 1in the main text, we demonstrate our algorithm on sentinel node sets of size n{1,2,3,4}𝑛1234n\in\{1,2,3,4\}italic_n ∈ { 1 , 2 , 3 , 4 } for the coupled double-well dynamics on the dolphin network. In Fig. S14, we show a similar example when we also optimize node weights. When we use n{2,3,4}𝑛234n\in\{2,3,4\}italic_n ∈ { 2 , 3 , 4 }, the additional optimization step—that is, optimizing node weights in addition to the combinatorial optimization—allows us to even more closely approximate the averaged network activity, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG.

Refer to caption
Figure S14: Approximation of the average state, x¯¯𝑥\overline{x}over¯ start_ARG italic_x end_ARG, of the coupled double-well dynamics on the dolphin network when we also optimize node weights. (a) n=1𝑛1n=1italic_n = 1. (b) n=2𝑛2n=2italic_n = 2. (c) n=3𝑛3n=3italic_n = 3. (d) n=4𝑛4n=4italic_n = 4. The tan lines represent the weight-optimized approximation. The purple lines represent the corresponding approximation when we do not optimize node weights.

S8.2 Effect of node set type on approximation error

To quantify performances of the optimized node sets in which we also optimize the node weights, we built an ANOVA model using lnε𝜀\ln\varepsilonroman_ln italic_ε as the dependent variable, as we did in section S4. We generated 100 weight-optimized node sets for each pair of dynamics and network. For comparison, we use the same completely random and combinatorially optimized node sets (i.e., without optimizing node weights) from section S4. The three independent variables were dynamics, network, and node set type (random, optimized, and weight-optimized). All three independent variables were significant (dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=5436.06𝐹5436.06F=5436.06italic_F = 5436.06, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; network: df=8𝑑𝑓8df=8italic_d italic_f = 8, F=155.76𝐹155.76F=155.76italic_F = 155.76, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=2𝑑𝑓2df=2italic_d italic_f = 2, F=12362.80𝐹12362.80F=12362.80italic_F = 12362.80, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT).

We computed the differences between the three node set types with Tukey’s honestly significant difference test. We show the results in Table S4. As described in the main text, the average lnε𝜀\ln\varepsilonroman_ln italic_ε for the weight-optimized node sets is small, i.e., 1/e3.832=46.161superscript𝑒3.83246.161/e^{-3.832}=46.161 / italic_e start_POSTSUPERSCRIPT - 3.832 end_POSTSUPERSCRIPT = 46.16 times smaller than the optimized node sets without the additional weight optimization step.

Table S4: Differences between node set types in terms of average lnε𝜀\ln\varepsilonroman_ln italic_ε when we optimize node weights and know the test dynamics. CI: confidence interval.
Difference CI p𝑝pitalic_p
Optimized -- Random 5.625-5.625-5.625- 5.625 [5.767,5.483]5.7675.483[-5.767,-5.483][ - 5.767 , - 5.483 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Weight-optimized -- Random 9.456-9.456-9.456- 9.456 [9.598,9.315]9.5989.315[-9.598,-9.315][ - 9.598 , - 9.315 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Weight-optimized -- Optimized 3.832-3.832-3.832- 3.832 [3.973,3.690]3.9733.690[-3.973,-3.690][ - 3.973 , - 3.690 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT

S8.3 Transfer learning

In this section, we investigate performances of weight-optimized node sets when one does not know the test dynamics. As in section S7, we generated 100 weight-optimized node sets optimized with a training dynamics and evaluated the approximation error of those node sets on a test dynamics. We did this for each combination of training dynamics, test dynamics, and network. For comparison, we include the completely random and optimized node sets generated in the analysis in section S7.

All of the independent variables of the constructed ANOVA model are significant (training dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=1353.61𝐹1353.61F=1353.61italic_F = 1353.61, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; test dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=11278.18𝐹11278.18F=11278.18italic_F = 11278.18, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; network: df=8𝑑𝑓8df=8italic_d italic_f = 8, F=285.22𝐹285.22F=285.22italic_F = 285.22, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=2𝑑𝑓2df=2italic_d italic_f = 2, F=5171.33𝐹5171.33F=5171.33italic_F = 5171.33, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT), and the ANOVA model fits the data well (R2=0.61superscript𝑅20.61R^{2}=0.61italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.61). As reported in the main text, there is no significant difference between weight-optimized node sets and our standard optimized node sets (i.e., without node-weight optimization) when we optimize on a dynamics that is not the test dynamics (p=0.591𝑝0.591p=0.591italic_p = 0.591).

Table S5: Differences between node set types in terms of average lnε𝜀\ln\varepsilonroman_ln italic_ε when we optimize node weights and do not know the test dynamics. CI: confidence interval.
Difference CI p𝑝pitalic_p
Optimized -- Random 1.748-1.748-1.748- 1.748 [1.794,1.701]1.7941.701[-1.794,-1.701][ - 1.794 , - 1.701 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Weight-optimized -- Random 1.728-1.728-1.728- 1.728 [1.775,1.682]1.7751.682[-1.775,-1.682][ - 1.775 , - 1.682 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Weight-optimized -- Optimized 0.0190.0190.0190.019 [0.027,0.066]0.0270.066[-0.027,0.066][ - 0.027 , 0.066 ] 0.5910.5910.5910.591

Appendix S9 Sentinel node set optimization on weighted networks

In the main text, we focused on unweighted networks (i.e., networks in which all the edges have the same weight equal to 1111). In this section, we show that our sentinel node approximation also performs well for weighted networks.

We collected ten weighted networks from online repositories. From the KONECT project [61], we used a dominance network of Japanese macaques [76] (N=62𝑁62N=62italic_N = 62, M=1167𝑀1167M=1167italic_M = 1167), a friendship network of high school students in Illinois, USA [77] (N=70𝑁70N=70italic_N = 70, M=274𝑀274M=274italic_M = 274), a friendship network of Australian university students [78] (N=217𝑁217N=217italic_N = 217, M=1839𝑀1839M=1839italic_M = 1839), and the proximity network from the main text but with edge weights retained [62] (N=410𝑁410N=410italic_N = 410, M=2765𝑀2765M=2765italic_M = 2765). From the Netzschleuder repository [79], we used a network of interpersonal contacts between windsurfers [80] (N=43𝑁43N=43italic_N = 43, M=336𝑀336M=336italic_M = 336), a network of contacts between individuals involved in the train bombing in 2004 in Madrid, Spain [81] (N=64𝑁64N=64italic_N = 64, M=243𝑀243M=243italic_M = 243), a network of drug interactions gathered from health records in Blumenau, Brazil [82] (N=75𝑁75N=75italic_N = 75, M=181𝑀181M=181italic_M = 181), a coauthorship network [83] (N=379𝑁379N=379italic_N = 379, M=914𝑀914M=914italic_M = 914)), a neuronal network of Caenorhabditis elegans [84] (N=460𝑁460N=460italic_N = 460, M=1432𝑀1432M=1432italic_M = 1432), and a product export network [85] (N=774𝑁774N=774italic_N = 774, M=1779𝑀1779M=1779italic_M = 1779). As we did for unweighted networks in the main text, we coerced each network to be undirected, discarded any self-edges, and analyzed only the largest connected component.

As before, we simulated each dynamics on each network, using the parameters described in the main text, obtaining xi,superscriptsubscript𝑥𝑖x_{i,\ell}^{*}italic_x start_POSTSUBSCRIPT italic_i , roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT {1L}for-all1𝐿\forall\ell\in\{1\ldots L\}∀ roman_ℓ ∈ { 1 … italic_L }. We then selected 100 node sets of size n=lnN𝑛𝑁n=\lfloor\ln N\rflooritalic_n = ⌊ roman_ln italic_N ⌋ of each type.

We show the approximation error for the coupled double-well dynamics on the ten weighted networks in Fig. S15. As was the case for unweighted networks, our optimization algorithm consistently (i.e., over 100 independent algorithm runs and for each network) finds sentinel node sets with lower approximation error than degree-preserving and completely random node sets.

Refer to caption
Figure S15: Approximation error on weighted networks for the coupled double-well dynamics for node sets with n=lnN𝑛𝑁n=\lfloor\ln N\rflooritalic_n = ⌊ roman_ln italic_N ⌋ nodes. We show the results for optimized (blue circles), degree-preserving (orange triangles), and completely random (green squares) node sets on all networks. Each marker represents the approximation error for one node set. The different panels show the results for different networks. (a) Windsurfer contact (with N=43𝑁43N=43italic_N = 43 nodes). (b) Macaque dominance (N=62𝑁62N=62italic_N = 62). (c) Terrorist contact (N=64𝑁64N=64italic_N = 64). (d) High school friendship (N=70𝑁70N=70italic_N = 70). (e) Drug interaction (N=75𝑁75N=75italic_N = 75). (f) University friendship (N=217𝑁217N=217italic_N = 217). (g) Coauthorship (N=379𝑁379N=379italic_N = 379). (h) Weighted proximity (N=410𝑁410N=410italic_N = 410). (i) Neuronal (N=460𝑁460N=460italic_N = 460). (j) Export (N=774𝑁774N=774italic_N = 774).

To systematically verify that the sentinel node approximation performs better than the other node set selection methods, we ran the same statistical analysis as the one we used for unweighted networks (see section S4). The ANOVA results were similar to those for unweighted networks (model R2=0.73superscript𝑅20.73R^{2}=0.73italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.73; dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=10300𝐹10300F=10300italic_F = 10300, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; network: df=9𝑑𝑓9df=9italic_d italic_f = 9, F=966.3𝐹966.3F=966.3italic_F = 966.3, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=6𝑑𝑓6df=6italic_d italic_f = 6, F=6243𝐹6243F=6243italic_F = 6243, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT). The optimized node sets had an average approximation error that was 960 times smaller than completely random node sets (b=6.867𝑏6.867b=-6.867italic_b = - 6.867, eb=0.001superscript𝑒𝑏0.001e^{b}=0.001italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.001, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT) and 262 times smaller than degree-preserving node sets (difference in coefficients: 5.5695.569-5.569- 5.569, eb=0.004superscript𝑒𝑏0.004e^{b}=0.004italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.004, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT); see Table S6. We note that the ordering of node set types in terms of approximation error is similar to that for unweighted networks, except that there is less difference between the heuristic node sets and the completely random node sets in the case of weighted networks. For example, both the k𝑘kitalic_k-constrained and community-based node sets are not significantly different from the completely random node sets.

Table S6: Differences between node set types in terms of average lnε𝜀\ln\varepsilonroman_ln italic_ε when the network has edge weights and we know the test dynamics. CI: confidence interval.
Difference CI p𝑝pitalic_p
Optimized -- Random 6.867-6.867-6.867- 6.867 [6.999,6.735]6.9996.735[-6.999,-6.735][ - 6.999 , - 6.735 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Random 1.298-1.298-1.298- 1.298 [1.430,1.166]1.4301.166[-1.430,-1.166][ - 1.430 , - 1.166 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Random 0.093-0.093-0.093- 0.093 [0.225,0.039]0.2250.039[-0.225,0.039][ - 0.225 , 0.039 ] 0.3700.3700.3700.370
k𝑘kitalic_k-quantiled -- Random 0.407-0.407-0.407- 0.407 [0.539,0.275]0.5390.275[-0.539,-0.275][ - 0.539 , - 0.275 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Random 0.196-0.196-0.196- 0.196 [0.328,0.064]0.3280.064[-0.328,-0.064][ - 0.328 , - 0.064 ] 2.36×1042.36superscript1042.36\times 10^{-4}2.36 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Community-based -- Random 0.100-0.100-0.100- 0.100 [0.232,0.032]0.2320.032[-0.232,0.032][ - 0.232 , 0.032 ] 0.2830.2830.2830.283
Degree-preserving -- Optimized 5.5695.5695.5695.569 [5.437,5.701]5.4375.701[5.437,5.701][ 5.437 , 5.701 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Optimized 6.7746.7746.7746.774 [6.642,6.906]6.6426.906[6.642,6.906][ 6.642 , 6.906 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Optimized 6.4606.4606.4606.460 [6.328,6.592]6.3286.592[6.328,6.592][ 6.328 , 6.592 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Optimized 6.6706.6706.6706.670 [6.538,6.802]6.5386.802[6.538,6.802][ 6.538 , 6.802 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Optimized 6.7676.7676.7676.767 [6.635,6.899]6.6356.899[6.635,6.899][ 6.635 , 6.899 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Degree-preserving 1.2051.2051.2051.205 [1.073,1.337]1.0731.337[1.073,1.337][ 1.073 , 1.337 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Degree-preserving 0.8910.8910.8910.891 [0.759,1.023]0.7591.023[0.759,1.023][ 0.759 , 1.023 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Degree-preserving 1.1011.1011.1011.101 [0.969,1.233]0.9691.233[0.969,1.233][ 0.969 , 1.233 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Degree-preserving 1.1981.1981.1981.198 [1.066,1.330]1.0661.330[1.066,1.330][ 1.066 , 1.330 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- k𝑘kitalic_k-constrained 0.314-0.314-0.314- 0.314 [0.446,0.182]0.4460.182[-0.446,-0.182][ - 0.446 , - 0.182 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-constrained 0.104-0.104-0.104- 0.104 [0.236,0.028]0.2360.028[-0.236,0.028][ - 0.236 , 0.028 ] 0.2380.2380.2380.238
Community-based -- k𝑘kitalic_k-constrained 0.007-0.007-0.007- 0.007 [0.139,0.125]0.1390.125[-0.139,0.125][ - 0.139 , 0.125 ] >0.999absent0.999>0.999> 0.999
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-quantiled 0.2100.2100.2100.210 [0.078,0.343]0.0780.343[0.078,0.343][ 0.078 , 0.343 ] 5.41×1055.41superscript1055.41\times 10^{-5}5.41 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Community-based -- k𝑘kitalic_k-quantiled 0.3070.3070.3070.307 [0.175,0.439]0.1750.439[0.175,0.439][ 0.175 , 0.439 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained 0.0970.0970.0970.097 [0.035,0.229]0.0350.229[-0.035,0.229][ - 0.035 , 0.229 ] 0.3180.3180.3180.318

The performance of our algorithm on weighted networks is also similar to that on unweighted networks when we do not know the test dynamics. In fact, by running the same analysis as that for the unweighted networks (see section S7), we find similar ANOVA results (R2=0.68superscript𝑅20.68R^{2}=0.68italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.68; training dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=4440𝐹4440F=4440italic_F = 4440, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; test dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=38634𝐹38634F=38634italic_F = 38634, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; network: df=9𝑑𝑓9df=9italic_d italic_f = 9, F=3490𝐹3490F=3490italic_F = 3490, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=6𝑑𝑓6df=6italic_d italic_f = 6, F=2391𝐹2391F=2391italic_F = 2391, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT). The average approximation error for the optimized node sets is 9.67 times smaller than for completely random node sets (b=2.269𝑏2.269b=-2.269italic_b = - 2.269, eb=0.096superscript𝑒𝑏0.096e^{b}=0.096italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.096, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; Table S7) and 3.66 times smaller than degree-preserving node sets (difference in coefficients: 1.2971.297-1.297- 1.297, eb=0.273superscript𝑒𝑏0.273e^{b}=0.273italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.273, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT).

Table S7: Differences between node set types in terms of average lnε𝜀\ln\varepsilonroman_ln italic_ε when the network has edge weights and we do not know the test dynamics. CI: confidence interval.
Difference CI p𝑝pitalic_p
Optimized -- Random 2.269-2.269-2.269- 2.269 [2.339,2.199]2.3392.199[-2.339,-2.199][ - 2.339 , - 2.199 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Random 0.972-0.972-0.972- 0.972 [1.042,0.903]1.0420.903[-1.042,-0.903][ - 1.042 , - 0.903 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Random 0.093-0.093-0.093- 0.093 [0.162,0.023]0.1620.023[-0.162,-0.023][ - 0.162 , - 0.023 ] 1.63×1031.63superscript1031.63\times 10^{-3}1.63 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Random 0.407-0.407-0.407- 0.407 [0.476,0.337]0.4760.337[-0.476,-0.337][ - 0.476 , - 0.337 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Random 0.196-0.196-0.196- 0.196 [0.266,0.127]0.2660.127[-0.266,-0.127][ - 0.266 , - 0.127 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Random 0.100-0.100-0.100- 0.100 [0.169,0.030]0.1690.030[-0.169,-0.030][ - 0.169 , - 0.030 ] 4.79×1044.79superscript1044.79\times 10^{-4}4.79 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Degree-preserving -- Optimized 1.2971.2971.2971.297 [1.227,1.366]1.2271.366[1.227,1.366][ 1.227 , 1.366 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Optimized 2.1762.1762.1762.176 [2.107,2.246]2.1072.246[2.107,2.246][ 2.107 , 2.246 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Optimized 1.8621.8621.8621.862 [1.793,1.932]1.7931.932[1.793,1.932][ 1.793 , 1.932 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Optimized 2.0732.0732.0732.073 [2.003,2.142]2.0032.142[2.003,2.142][ 2.003 , 2.142 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Optimized 2.1692.1692.1692.169 [2.100,2.239]2.1002.239[2.100,2.239][ 2.100 , 2.239 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Degree-preserving 0.8790.8790.8790.879 [0.810,0.949]0.8100.949[0.810,0.949][ 0.810 , 0.949 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Degree-preserving 0.5650.5650.5650.565 [0.496,0.635]0.4960.635[0.496,0.635][ 0.496 , 0.635 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Degree-preserving 0.7760.7760.7760.776 [0.706,0.845]0.7060.845[0.706,0.845][ 0.706 , 0.845 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Degree-preserving 0.8730.8730.8730.873 [0.803,0.942]0.8030.942[0.803,0.942][ 0.803 , 0.942 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- k𝑘kitalic_k-constrained 0.314-0.314-0.314- 0.314 [0.384,0.245]0.3840.245[-0.384,-0.245][ - 0.384 , - 0.245 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-constrained 0.104-0.104-0.104- 0.104 [0.173,0.034]0.1730.034[-0.173,-0.034][ - 0.173 , - 0.034 ] 2.24×1042.24superscript1042.24\times 10^{-4}2.24 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Community-based -- k𝑘kitalic_k-constrained 0.007-0.007-0.007- 0.007 [0.076,0.063]0.0760.063[-0.076,0.063][ - 0.076 , 0.063 ] >0.999absent0.999>0.999> 0.999
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-quantiled 0.2100.2100.2100.210 [0.141,0.280]0.1410.280[0.141,0.280][ 0.141 , 0.280 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- k𝑘kitalic_k-quantiled 0.3070.3070.3070.307 [0.238,0.377]0.2380.377[0.238,0.377][ 0.238 , 0.377 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained 0.0970.0970.0970.097 [0.027,0.166]0.0270.166[0.027,0.166][ 0.027 , 0.166 ] 8.04×1048.04superscript1048.04\times 10^{-4}8.04 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT

Appendix S10 Sentinel node set optimization on directed networks

In this section, we show that our sentinel node approximation also performs well for directed networks. For simplicity, we consider the largest weakly connected component of each network and do not consider networks with edge weights.

We collected ten directed networks from the Netzschleuder repository [79]: a freshwater trophic network [86] (N=109𝑁109N=109italic_N = 109, M=717𝑀717M=717italic_M = 717), a social network of physicians [87] (N=117𝑁117N=117italic_N = 117, M=542𝑀542M=542italic_M = 542), an email network from a manufacturing company [88] (N=167𝑁167N=167italic_N = 167, M=5783𝑀5783M=5783italic_M = 5783), a dependency network of the Flamingo software [89] (N=228𝑁228N=228italic_N = 228, M=497𝑀497M=497italic_M = 497), a transcription network of the bacterium Escherichia coli [90] (N=328𝑁328N=328italic_N = 328, M=456𝑀456M=456italic_M = 456), a transcription network of the yeast Saccharomyces cerevisiae [91] (N=664𝑁664N=664italic_N = 664, M=1066𝑀1066M=1066italic_M = 1066), a network of US air carrier flights in 2020 [92] (N=806𝑁806N=806italic_N = 806, M=11924𝑀11924M=11924italic_M = 11924), a dependency network of the JUNG software [89] (N=879𝑁879N=879italic_N = 879, M=2051𝑀2051M=2051italic_M = 2051), the directed version of the university email network from the main text [65] (N=1133𝑁1133N=1133italic_N = 1133, M=10902𝑀10902M=10902italic_M = 10902), and a network of air routes preferred by the US Federal Aviation Administration [93] (N=1226𝑁1226N=1226italic_N = 1226, M=2613𝑀2613M=2613italic_M = 2613). We discarded any self-loops and the multiplicity of any multi-edges.

We show the approximation error for the coupled double-well dynamics on the ten directed networks in Fig. S16. The figure indicates that our optimization algorithm reliably identifies node sets that obtain relatively small approximation error in each of the ten directed networks.

Refer to caption
Figure S16: Approximation error for the coupled double-well dynamics on directed networks with n=lnN𝑛𝑁n=\lfloor\ln N\rflooritalic_n = ⌊ roman_ln italic_N ⌋ nodes. (a) Trophic. (b) Physicians. (c) Email (manufacturer). (d) Flamingo. (e) Escherichia coli. (f) Saccharomyces cerevisiae. (g) Flights. (h) JUNG. (i) Email (university). (j) Air route.

We verified our observations with an ANOVA (model R2=0.60superscript𝑅20.60R^{2}=0.60italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.60; dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=5048𝐹5048F=5048italic_F = 5048, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; network: df=9𝑑𝑓9df=9italic_d italic_f = 9, F=1925𝐹1925F=1925italic_F = 1925, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=6𝑑𝑓6df=6italic_d italic_f = 6, F=1541𝐹1541F=1541italic_F = 1541, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT). On average, the approximation error for optimized node sets was 211 times smaller than completely random node sets (b=5.351𝑏5.351b=-5.351italic_b = - 5.351, eb=0.005superscript𝑒𝑏0.005e^{b}=0.005italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.005, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT) and 62.4 times smaller than degree-preserving node sets (difference in coefficients: 4.1344.134-4.134- 4.134, eb=0.016superscript𝑒𝑏0.016e^{b}=0.016italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.016, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT); see Table S8.

Table S8: Differences between node set types in terms of average lnε𝜀\ln\varepsilonroman_ln italic_ε when the edges are directed and we know the test dynamics. CI: confidence interval.
Difference CI p𝑝pitalic_p
Optimized -- Random 5.351-5.351-5.351- 5.351 [5.563,5.139]5.5635.139[-5.563,-5.139][ - 5.563 , - 5.139 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Random 1.217-1.217-1.217- 1.217 [1.429,1.005]1.4291.005[-1.429,-1.005][ - 1.429 , - 1.005 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Random 0.081-0.081-0.081- 0.081 [0.293,0.131]0.2930.131[-0.293,0.131][ - 0.293 , 0.131 ] 0.921
k𝑘kitalic_k-quantiled -- Random 0.362-0.362-0.362- 0.362 [0.574,0.150]0.5740.150[-0.574,-0.150][ - 0.574 , - 0.150 ] 1.00×1051.00superscript1051.00\times 10^{-5}1.00 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Random 0.045-0.045-0.045- 0.045 [0.257,0.167]0.2570.167[-0.257,0.167][ - 0.257 , 0.167 ] 0.996
Community-based -- Random 0.3470.3470.3470.347 [0.134,0.561]0.1340.561[0.134,0.561][ 0.134 , 0.561 ] 3.19×1053.19superscript1053.19\times 10^{-5}3.19 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT
Degree-preserving -- Optimized 4.1344.1344.1344.134 [3.922,4.346]3.9224.346[3.922,4.346][ 3.922 , 4.346 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Optimized 5.2705.2705.2705.270 [5.058,5.482]5.0585.482[5.058,5.482][ 5.058 , 5.482 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Optimized 4.9894.9894.9894.989 [4.777,5.201]4.7775.201[4.777,5.201][ 4.777 , 5.201 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Optimized 5.3065.3065.3065.306 [5.094,5.518]5.0945.518[5.094,5.518][ 5.094 , 5.518 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Optimized 5.6995.6995.6995.699 [5.486,5.912]5.4865.912[5.486,5.912][ 5.486 , 5.912 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Degree-preserving 1.1361.1361.1361.136 [0.924,1.348]0.9241.348[0.924,1.348][ 0.924 , 1.348 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Degree-preserving 0.8550.8550.8550.855 [0.643,1.067]0.6431.067[0.643,1.067][ 0.643 , 1.067 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Degree-preserving 1.1721.1721.1721.172 [0.960,1.384]0.9601.384[0.960,1.384][ 0.960 , 1.384 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Degree-preserving 1.5641.5641.5641.564 [1.351,1.778]1.3511.778[1.351,1.778][ 1.351 , 1.778 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- k𝑘kitalic_k-constrained 0.281-0.281-0.281- 0.281 [0.493,0.069]0.4930.069[-0.493,-0.069][ - 0.493 , - 0.069 ] 1.80×1031.80superscript1031.80\times 10^{-3}1.80 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-constrained 0.0360.0360.0360.036 -[0.176,0.248]0.1760.248[0.176,0.248][ 0.176 , 0.248 ] 0.999
Community-based -- k𝑘kitalic_k-constrained 0.4280.4280.4280.428 [0.215,0.641]0.2150.641[0.215,0.641][ 0.215 , 0.641 ] 107superscript10710^{-7}10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-quantiled 0.3170.3170.3170.317 [0.105,0.529]0.1050.529[0.105,0.529][ 0.105 , 0.529 ] 2.15×1042.15superscript1042.15\times 10^{-4}2.15 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Community-based -- k𝑘kitalic_k-quantiled 0.7090.7090.7090.709 [0.496,0.922]0.4960.922[0.496,0.922][ 0.496 , 0.922 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained 0.3930.3930.3930.393 [0.179,0.606]0.1790.606[0.179,0.606][ 0.179 , 0.606 ] 1.2×1061.2superscript1061.2\times 10^{-6}1.2 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT

The ANOVA for directed networks in the case when we do not know the test dynamics fits the data reasonably well (model R2=0.53superscript𝑅20.53R^{2}=0.53italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.53; training dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=1384𝐹1384F=1384italic_F = 1384, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; test dynamics: df=3𝑑𝑓3df=3italic_d italic_f = 3, F=10888𝐹10888F=10888italic_F = 10888, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; network: df=9𝑑𝑓9df=9italic_d italic_f = 9, F=6397𝐹6397F=6397italic_F = 6397, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT; node set type: df=6𝑑𝑓6df=6italic_d italic_f = 6, F=261𝐹261F=261italic_F = 261, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT). As is the case for undirected networks, the performance of optimized node sets is worse when the test dynamics is different from the training dynamics, but optimized node sets still outperform all other node sets. Optimized node sets had on average 3.02 times lower approximation error than completely random node sets (b=1.104𝑏1.104b=-1.104italic_b = - 1.104, eb=0.332superscript𝑒𝑏0.332e^{b}=0.332italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.332, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT) and 1.81 times smaller error than degree-preserving node sets (difference in coefficients: 0.5950.595-0.595- 0.595, eb=0.552superscript𝑒𝑏0.552e^{b}=0.552italic_e start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = 0.552, p<107𝑝superscript107p<10^{-7}italic_p < 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT); see Table S9.

Table S9: Differences between node set types in terms of average lnε𝜀\ln\varepsilonroman_ln italic_ε when the edges are directed and we do not know the test dynamics. CI: confidence interval.
Difference CI p𝑝pitalic_p
Optimized -- Random 1.104-1.104-1.104- 1.104 [1.224,0.984]1.2240.984[-1.224,-0.984][ - 1.224 , - 0.984 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Random 0.508-0.508-0.508- 0.508 [0.628,0.388]0.6280.388[-0.628,-0.388][ - 0.628 , - 0.388 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Random 0.081-0.081-0.081- 0.081 [0.201,0.039]0.2010.039[-0.201,0.039][ - 0.201 , 0.039 ] 0.422
k𝑘kitalic_k-quantiled -- Random 0.362-0.362-0.362- 0.362 [0.482,0.242]0.4820.242[-0.482,-0.242][ - 0.482 , - 0.242 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Random 0.045-0.045-0.045- 0.045 [0.165,0.075]0.1650.075[-0.165,0.075][ - 0.165 , 0.075 ] 0.925
Community-based -- Random 0.3510.3510.3510.351 [0.231,0.472]0.2310.472[0.231,0.472][ 0.231 , 0.472 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Degree-preserving -- Optimized 0.5950.5950.5950.595 [0.475,0.715]0.4750.715[0.475,0.715][ 0.475 , 0.715 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Optimized 1.0231.0231.0231.023 [0.903,1.143]0.9031.143[0.903,1.143][ 0.903 , 1.143 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Optimized 0.7420.7420.7420.742 [0.622,0.862]0.6220.862[0.622,0.862][ 0.622 , 0.862 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Optimized 1.0581.0581.0581.058 [0.939,1.178]0.9391.178[0.939,1.178][ 0.939 , 1.178 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Optimized 1.4551.4551.4551.455 [1.334,1.575]1.3341.575[1.334,1.575][ 1.334 , 1.575 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-constrained -- Degree-preserving 0.4270.4270.4270.427 [0.308,0.547]0.3080.547[0.308,0.547][ 0.308 , 0.547 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- Degree-preserving 0.1460.1460.1460.146 [0.027,0.266]0.0270.266[0.027,0.266][ 0.027 , 0.266 ] 5.86×1035.86superscript1035.86\times 10^{-3}5.86 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- Degree-preserving 0.4630.4630.4630.463 [0.343,0.583]0.3430.583[0.343,0.583][ 0.343 , 0.583 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- Degree-preserving 0.8590.8590.8590.859 [0.739,0.980]0.7390.980[0.739,0.980][ 0.739 , 0.980 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
k𝑘kitalic_k-quantiled -- k𝑘kitalic_k-constrained 0.281-0.281-0.281- 0.281 [0.401,0.161]0.4010.161[-0.401,-0.161][ - 0.401 , - 0.161 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-constrained 0.0360.0360.0360.036 [0.084,0.156]0.0840.156[-0.084,0.156][ - 0.084 , 0.156 ] 0.976
Community-based -- k𝑘kitalic_k-constrained 0.4320.4320.4320.432 [0.311,0.552]0.3110.552[0.311,0.552][ 0.311 , 0.552 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained -- k𝑘kitalic_k-quantiled 0.3170.3170.3170.317 [0.197,0.437]0.1970.437[0.197,0.437][ 0.197 , 0.437 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- k𝑘kitalic_k-quantiled 0.7130.7130.7130.713 [0.592,0.834]0.5920.834[0.592,0.834][ 0.592 , 0.834 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
Community-based -- knnsubscript𝑘nnk_{\rm nn}italic_k start_POSTSUBSCRIPT roman_nn end_POSTSUBSCRIPT-constrained 0.3960.3960.3960.396 [0.276,0.517]0.2760.517[0.276,0.517][ 0.276 , 0.517 ] <107absentsuperscript107<10^{-7}< 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT