Detecting and Mitigating Bias in Algorithms Used to Disseminate Information in Social Networks

Vedran Sekara,^1∗† Ivan Dotu,^2∗† Manuel Cebrian,^3,4
Esteban Moro,⁵ and Manuel Garcia-Herranz^6∗

¹IT University of Copenhagen, 2300 Copenhagen, Denmark
²UNICEF/GIGA, New York, NY, USA
³Department of Statistics, Universidad Carlos III de Madrid, Madrid, Spain
⁴Center for Automation and Robotics, Spanish National Research Council, Madrid, Spain
⁵Network Science Institute, Northeastern University, Boston, MA, USA
⁶UNICEF, New York, NY, USA

^∗To whom correspondence should be addressed; E-mail: vsek@itu.dk, jdoturodriguez@unicef.org, and
mherranz@unicef.org.
^†Contributed equally to the paper

Social connections are a conduit through which individuals communicate, information propagates, and diseases spread. Identifying individuals that are more likely to adopt ideas or technologies and spread them to others is essential in order to develop effective information campaigns, fight epidemics, and to maximize the reach of limited resources. Consequently a lot of work has focused on identifying sets of influencers. Here we show that seeding information using these influence maximization methods, only benefits connected and central individuals, consistently leaving the most vulnerable behind. Our results highlights troublesome outcomes of influence maximization algorithms: they do not disseminate information in an equitable manner threatening to create an increasingly unequal society. To overcome this issue we devise a simple, multi-objective algorithm, which maximises both influence and information equity. Our work demonstrates how to find fairer influencer sets, highlighting that in our search for maximizing information, we do not need to compromise on information equality.

Introduction

Social relationships serve as important vectors through which a multitude of behaviors spread, from health related behaviors (?, ?), innovation (?, ?), decisions of micro-financing (?), decision to insure (?), happiness (?), cultural tastes (?), to the emergence of social movements (?, ?). Knowing through which pathways information spreads is vital for international development (?, ?) and crucial in developing more efficient methodologies that maximize the diffusion of potentially life saving information (?, ?). Due to resource constraints it is unfeasible to send a piece of information to all individuals within a network. Instead, a frequently adopted strategy is to seed a small set of individuals, much smaller than the full population, located at strategic places in the network whose activation (or removal) would facilitate the spread of information (or in the case of epidemics inhibit a disease from spreading). Numerous methods have been proposed to identify ”the set” of influential nodes. These methodologies can be divided up into two fundamentally distinct classes (?, ?): (1) methods that identify individuals who are highly connected or central, and effective at diffusing information – so-called superspreader methods (?, ?, ?, ?, ?, ?, ?). And (2) methods which highlight individuals that occupy structurally vital positions in a network whose removal would destroy the network and subsequently block information from propagating – also called superblocker methods (?, ?, ?, ?, ?, ?). Although the methods are different there is a general consensus that they both pinpoint nodes which are highly efficient conduits for information propagation (?).

Unfortunately, we have a limited understanding of which demographics these methods reach and which communities they leave behind, but there are alarming signs in the network science literature. Previous work has demonstrated that wealth is connected to diversity of social connections, the richer you are the more diverse is your social network, in turn enabling you to receive information from many directions (?, ?, ?). Similarly, influence maximisation has been shown create gaps in information access (?), and to have a gender skew, with male individuals having an advantage in being selected as influencers (?, ?, ?). Previous work has also shown that an individuals’ chances of being ranked as an influencer are highly correlated with personal economic status (?). Further, as social systems display strong levels of homophily (?), where connections between similar individuals occur at higher rates than between dissimilar individuals, wealthy individuals are more likely to befriend people that resemble them, and less likely to interact with the poor (?). As a consequence, information in social networks tends to be localized within social strata, restricting diffusion across demographic and socioeconomic gaps and between majority and minority groups (?). Access to information is a major factor of social vulnerability (?). By naively using current influence maximisation methodologies to selects predominantly wealthy influencers, through which information is inserted, we run the risk of tailoring information campaigns towards the most affluent groups of our societies, while severely under-representing the most vulnerable and marginalized.

Defining Informational Vulnerability

Vulnerability is a complex issue determined by physical, economic, social, and environmental factors, which decrease the capacity of individuals and groups to cope, anticipate, and react to hazards (?). Here we study one aspect of vulnerability, namely individuals’ access to information. Previous work has looked into the interplay between influence maximization and different characteristics of the nodes (such as age, gender, ethnicity) which receive information (?, ?, ?, ?). However, rather than assuming we have access to this type of information we focus on the general utility of the information individuals receive—this has previously been called a ’welfare approach’ (?, ?). As such, we study how access to information is distributed across all nodes in the network. Specifically, we define informational vulnerability as the average likelihood that an individual receives information, estimated from many independent diffusion cascades. We focus on two facets of this stochastic process: 1) frequency: how often does an individual receive information (i.e. how often are they reached by cascades), and 2) recency: how old are the cascades when they reach them (e.g. at what step of the cascade is the individual reached by it). Looking at just one cascade will reveal very little, but upon many independent simulations it is easy to see that not every individual is equally lucky on their network position and that some tend to be reached by more cascades, or earlier than others. Network structure and node locations in the network play very important roles in this. To simulate information diffusion processes on networks we apply the commonly used Independent Cascade Model (ICM) in its most simple form, unweighted and undirected (?). ICMs are commonly used to study influence maximisation in social networks (?, ?) and are a special case of susceptible-infected-recovered (SIR) models, where the recovery probability is fixed to 1. The ICM allows an informed individual one attempt to speak to their neighbors and convince them to adopt a behavior; the neighbors, if convinced, will then try to convince their neighbors, and so on (see Supplementary Materials (SM) Sec. S3 for more details).

To identify sets of influencers we focus on four widely used methods: (i) highest degree (?) (HD), which is a commonly used heuristic where nodes are selected according to their number of connections, (ii) degree discount (?) (DD), an efficient method for identifying superspreaders, (iii) coreHD (?) (CHD), a state-of-the-art method for inferring superblockers, and (iv) k-core (?) (KC), selecting nodes located in the core of the network. Fig. 1a illustrates, for a small real-world network, which nodes each method pinpoints as influencers. As ICM is a stochastic process we average over multiple simulations. For each realisation of the dynamic process we track which nodes are activated and how long it takes for the spreading process to reach them. We quantify this using two measures. The first measure, information frequency:

\nu_{i}=\frac{1}{M}\>\sum_{n}^{M}a_{i,n}\>,

summarizes the average fraction of times node $i$ has been reached, where $a_{i,n}=1$ if node $i$ received information in realisation $n$ and is zero otherwise, and $M$ is the total number of simulations. Information frequency ( $\nu_{i}$ ) lies in the interval 0 to 1, where zero indicates that a node is never reached by the $M$ cascades, while a value of 1 indicates the node is reached by every cascade.

The second measure, information recency:

\tau_{i}=\frac{1}{M}\>\sum_{n}^{M}\frac{1}{t_{i,n}+1}\>,

quantifies the temporal delay from process initialization ( $t=0$ ) until node $i$ is activated. Recency is calculated as the average of the inverse activation time to handle cases where the information spreading process dies out before reaching a node. Nodes that on average receive information quickly have $t_{i,n}\rightarrow 0$ and $\tau\rightarrow 1$ , while nodes that are activated very late, or never, ( $t_{i,n}\rightarrow\infty$ ) have $\tau\rightarrow 0$ .

To understand the shortcomings of the four methods in selecting influencers we compare them to a benchmark model where all nodes have equal chance of being selected (selected at random) — we call this the effective measure. If the ratio $\nu_{i}^{\text{method}}/\nu_{i}^{\text{benchmark}}>1$ a node will on average receive information more frequently when seeds are selected using a specific method as compared to when nodes are selected at random. If $\nu_{i}^{\text{method}}/\nu_{i}^{\text{benchmark}}<1$ a node will be better off when information is inserted at random entry points in the network. The same holds for recency, $\tau_{i}^{\text{method}}/\tau_{i}^{\text{benchmark}}>1$ indicates that a node on average receives more recent information when using a influencer maximisation method, while $\tau_{i}^{\text{method}}/\tau_{i}^{\text{benchmark}}<1$ denotes that information is received faster when information is inserted at random nodes. Fig. 1b illustrates the resulting effective recency-values from using influencers inferred by the four methods as seeds. Independent on methodology, the influencer nodes and their surrounding neighbors are always reached by the influencer set, however, a large fraction of nodes seems to be left behind. Typically nodes located on the periphery.

Refer to caption — Figure 1: Inequalities and diffusion of information in networks. a, Initial seed sets selected according to HD, CHD, DD, and KC, illustrating variations in how the four methods select influencers for a social network between households in a south-Indian village (?). In this example $5\%$ of nodes are selected as influencers. Colored nodes indicate selected influencers. b, Effective recency for the social network. Recency is estimated across 1000 runs with $p_{c}=0.069$ . (se SM Sec. S3.1) c, Cumulative distribution of individual node frequency within synthetic scale-free (SF) networks with $N=10^{4}$ , $\gamma=2.5$ , and $\langle p_{c}\rangle=0.085$ . The curves show the probability that $\nu$ is less than or equal to $x$ , where $x$ is any arbitrary value. Results are combined over 100 different network realizations. For each network we select an initial influencer set (1 % of nodes) inferred from one of the heuristics, run the spreading process, track which nodes receive information, and repeat the process $M=10N$ times to account for stochasticity in both the seed selection and the spreading processes. Red shaded regions denote parts of the distribution where the effective measure (the ratio) is below 1, while grey shaded indicate places where the ratio is above 1. d, Cumulative distribution of recency within SF networks. e, Fraction of nodes that are worse off with respect to information frequency in $n$ of the seeding heuristics when compared to the benchmark. Error bars are standard deviation over 100 network realizations. f, Fraction of nodes that are worse off with respect to recency.

Quantifying Informational Vulnerability

To formalize the observations from Fig. 1b we first investigate the four methodologies on a testbed of 100 realizations of synthetic unweighted and undirected networks with scale-free (SF) degree distributions (see SM Sec. S6 for random networks with normal degree distributions). While perfect SF networks are rarely observed in nature they are powerful simplifications of real-world networks (?, ?). In order to compare heuristics we construct influencer sets from a fixed finite fraction of the network population—1% of nodes (see SM Sec. S5 for other seed sizes).

Seeding information through random nodes in SF networks results in a near-homogeneous frequency distribution: i.e. node receives a constant stream of information (Fig. 1c, black line). In Fig. 1c, a completely equal information metric would be characterized by a vertical line in the cumulative probability distribution. Note that the black line of Fig. 1c, characterizing the random process, is not vertical due to the intrinsic variations of the network structure. Using the four methods to select influencers sets, however, results in fundamentally different frequency distributions (Fig. 1c, colored lines). Approximately half of nodes have effective frequency values ( $\nu^{\text{method}}/\nu^{\text{benchmark}}$ ) above one, meaning they receive information more frequently then expected using a random process, while half the network receives much less compared to when seeds are selected at random (see also SM Sec. S4). The situation is similar for the recency distribution, albeit slightly more polarized (Fig. 1d). Looking across the different influencer maximisation heuristics, Fig. 1e, demonstrates that if a node is under-informed by one method, it will most likely not be better reached by any other method. On average $42.8\pm 1.7\%$ of nodes are better informed when information is seeded using any influencer maximization method. We say that these nodes are always better off. However, $47.0\pm 2.2\%$ of nodes are consistently left behind, independent of which method is used to select influencers. With respect to recency, Fig. 1f highlights an even worse situation, $55.6\pm 2.2\%$ of nodes receive out-of-date information, independent on which information maximisation methodology is used.

To understand the implications of information inequalities for real-world networks we look at communication, interaction, and social networks varying in size from hundreds to tens of thousands of individuals. The networks differ in context from: face-to-face encounters (?), connections between bloggers and blogs on political topics (?), digital communication between university students (?), online friendships (?), and scientific collaborations (?) (see SM Sec. S2 for more details).

For information frequency, Fig. 2a shows that current state-of-the-art influencer heuristics identify seed-sets that, on average, leave significant portions of the network in disadvantaged positions. HD, CHD, and DD methods result in fairly similar $\nu$ distributions, while selecting influencers according to KC performs much worse (approx. 80% of nodes are better reached by randomly selected influencers). This is due to real-world network having a large numbers of cliques, where small discrepancies in shell numbers can result in KC only selecting nodes from a single clique (?), effectively limiting the diffusion of information. Similar behavior is observed for information recency ( $\tau$ ). Fig. 2b shows that HD, CHD, and DD leave behind a comparable numbers of nodes, while KC again performs worse. Summarizing the average reach of influencer heuristics, we find that up to $69.8\%$ of a network might receive information less frequently compared to if it is input at random (Fig. 2c), and information can reach up to $79.2\%$ of individuals slower (Fig. 2d).

There is a connection between access to information and a node’s position in the network (?). The connection is so strong that a predictive model, based on structural features such as node degree, k-shell number, clustering, centrality, and eccentricity, can accurately predict whether a node will fall into the group of ‘worse off according to all influencer heuristics‘ or ‘better off in all’ (see SM Sec. S7). We can, on average, with 97.4% accuracy predict the information status of nodes regarding frequency, and with an average accuracy of 96.9% for recency. This demonstrates that current influencer heuristics suffer from biases which disadvantage low-connected and peripheral nodes.

Fair Influence Maximisation

To bridge the information gap different strategies can be employed. Previous studies have shown that instead of using influencer algorithms to identify $s$ individuals, one can select slightly more individuals $s+x$ , but at random (?). Even for small $x$ -values this has been shown to result in larger cascades. Another solution is to apply acquaintance methods (also called friendship-nomination, which work by selecting a random neighbor of a randomly selected node), which have been used to select influencers for maximising mass drug administration campaigns (?), to seed information about maternal and child health (?), and for inferring centrally located individuals suitable as monitors for detecting large scale disease outbreaks (?). Lastly, maximization of social welfare objective functions, along with the implementation of different algorithms, has been proposed to bridge the information gap (?).

We embark on a different solution. Traditionally, influence maximisation has only focused on maximising a single objective function, information spread. However, as literature from the field of Artificial Intelligence shows, focusing solely on optimizing one parameter can lead to troubling and unfair outcomes (?, ?, ?). Instead, we propose a multi-objective formulation of fair influence maximisation, where both spread and fairness are taken into account in the fitness of a candidate solution (i.e. the set of selected influencers). We measure fairness as the fraction of nodes that receive information at a higher frequency and speed than what is expected from the benchmark model, where influencers are selected at random. We say a node is vulnerable if it receive less information than what can be expected at random (i.e. $\nu_{i}^{\text{method}}/\nu_{i}^{\text{benchmark}}<1$ ). In the following we focus primarily on information frequency, but similar results can be obtained for recency. As such, the more fair an influencer set is, the fewer nodes will be vulnerable, so we maximise the number of non-vulnerable nodes. Analytically calculating how information will spread from a set of nodes according to the ICM model is a computationally hard problem (NP-hard) (?). As such, we use an approximation of the fitness of a influencer set (see SM Sec. S8). To find fairer seeds we then use a genetic algorithm to solve the optimization problem (see Methods).

Fig. 3a shows the theoretical Pareto front (in multi objective optimization a Pareto front denotes a line of optimal solutions (?)) for the social network between households of a south-Indian village (?). Our method identifies seed sets which are more fair and, at the same time, as effective at maximizing influence as the influence maximization heuristics (HD, CHD, DD). (We disregard here seed sets identified by KC as they are far inferior to the ones produced by the other three heuristics.) For this network, our algorithm identifies nine possible influencer sets, undiscovered by the traditional heuristics, with different trade-offs between maximising information reach (cascade size) and the number of non-vulnerable nodes (fairness). As these findings are based on an theoretical approximation we also evaluate these seeds sets using ICMs. Fig. 3b, shows our theoretical predictions are consistent with results found by numerically simulating information spread. As such, for a negligible reduction in cascade size we can, for this specific network, choose fairer seeds that roughly corresponds to 6% to 10 $\%$ fewer vulnerable nodes. Fig. 3c-d illustrates the difference between the seed nodes inferred from using a state-of-the-art influencer heuristics (CHD) and our approach. The figure shows initial seed nodes (in black) and the activation of edges, where an activated edge indicates successful information propagation. There is a clear difference between the structural locations of the nodes in the network. Using the average distance from seed nodes to the rest of the network we calculate how far each seed set is from all nodes in the network. We find that our method identifies nodes which are more evenly distributed in the network and, on average, closer to the overall network (see Sec. S10 in the SM). In turn, this results in larger parts of the network being more easily reached.

Fig. 3a,b show an example for one specific network, however, our fair influence maximization method works equally well for other networks (Fig. 3e-h). For other networks (Fig. 3e-h) the traditional heuristics identify influencer sets which optimize cascade size, however, we find that these seed sets are not fair in terms of information equality (for numerical results see Supplementary Materials). It is possible to improve this, with large gains in fairness being achievable by trading off a small reduction in cascade size. Overall we find that a $1\%$ trade-off in reach can result in a decrease in number of vulnerable individual by: $1.9\%$ for the network of political blogs, $3.1\%$ for student communication, $10.5\%$ for online friendships, and up to $24\%$ in collaboration networks (here we have disregard the face-to-face network due to the low number of identified seed sets which makes it difficult to calculate the proper trade-off value).

Lastly, for the collaboration network (Fig. 3h), we find that using traditional heuristics are not close to the Pareto frontier. Meaning they are sub-optimal in term of both cascades sizes and fair information access. In this situation our algorithm can also be used to identify, previously undiscovered, seed sets which optimize cascade-size.

Discussion

The United Nations sustainable development goals (SDGs) (?) recognize that eradicating inequalities in all their forms and dimensions are one of the greatest global challenges our societies face. Algorithms have the power to deliver on the SDGs. For example, access to information is critical for vaccination campaigns and algorithms, such as influence maximization, have a role to play in effectivizing these campaigns. However, algorithms can also bring potential biases into play. Our results show there is a group of nodes that are consistently left behind by influence maximization algorithms. In particular, for both real-world and synthetic networks, we find that access to information in unequal, both in terms of how often information is received and how recent the information is. A behavior that is not limited to low or high clustered networks, nor to specific types of interactions (see Supplementary Materials Table S1); we find it present across all networks we investigated.

Although algorithmic systems can be biased due to many factors (?), it is often thought that biases usually appear due to skews or misrepresentations in training data. However, that is not the case for influence maximization algorithms. Here, the issue lies with the problem statement with the choice of the objective function. By developing algorithms that solely focus on optimizing reach, without considering equity or fairness aspects creates a bias. Unfortunately, not receiving information has real-world consequences. For example, experiences from mass drug administration campaigns have pinpointed that individuals are left untreated not due to lack of medicine, but because they never receive information about the campaign (?). Further, due to the pervasive usage of influence maximization algorithms in social recommendations (?), network monitoring (?), and even rumor control (?), the cumulative effects of information inequality in the populations left behind, in combination with other types of digital gaps, can create large fractures in the social fabric of our societies.

Thus, it is vital to understand if such algorithms are equitable, to quantify the level of inequality, and propose potential alternatives that balance potential reach and equity. Multi-objective-optimization is a well-known computational tool which adds nuance to optimization problems and makes it possible to include multiple criteria when selecting influencer sets. As such, to close the information equity gap we propose a multi-objective formulation of the fair influence maximization problem and develop a genetic algorithm to solve it. Our results demonstrate it is possible to find influencer sets that reduce vulnerability at a relatively low trade-off with respect to spread. For example, we find that a mere $1\%$ reduction in reach (i.e. cascade size) can reduce the number of people typically left behind in information campaigns by up to $24\%$ .

Our multi-objective algorithm is a first approach at solving this critical problem, yet it is not perfect. We believe it can act as a starting point towards more systematic solutions towards fair information access, as this issue arise across many other contexts within network science, artificial intelligence, and computational science problems (?, ?). One particular application can be online social networks where incorporating additional algorithmic objectives can be beneficial to: help detect vulnerable individuals, mitigate and reduce segregation, lessen polarization between groups, and help guide the design of more equal information dissemination structures.

Our approach requires information about the full network. Noise and incomplete mappings of networks will naturally affect this (?). However, we believe the effect will not differ from what traditional heuristics like degree discount, CoreHD or highest degree already experience, as they also require information about the total graph. Another shortcoming is that we focus purely on simple contagion effects, where nodes have equal, and independent probabilities of adopting a behavior. Complex contagion (where individuals require social affirmation from multiple sources) has been observed for certain social settings, including sharing of hashtags on social media (?), social bots (?), and certain online behaviors (?). The nature of contagion depends on the type of the situation, and whether interactions happen at a local, or global, level. We focus on simple contagion because it is believed to be the main factor in information spreading (?). For example, if a person is looking for a new job, it is more beneficial to receive information from the global network, via weak ties, rather than just from close friends and family (?). However, future work should understand how information inequalities develop in complex contagion scenarios. Further, the current study focuses on static and undirected networks, but information propagation in real world often occurs through directed and temporal connections. Extending the framework to such networks and investigating if similar inequalities emerge is an equally important direction for future work.

Lastly, our definition of information inequality relies on benchmarking existing methods to random information spreading scenarios, as this is the most ‘fair’ system we can imagine. Other definitions can also be used, and future work should focus on testing them. Independent of the choice of definition, it is vital that inequalities, which arise, or are amplified, as result of algorithms, be quantified and measured. As our world is becoming increasingly digitalized, access to correct, timely, and factual information will grow in significance. As such, it is critical to know how well algorithms which deal with information dissemination and delivery work, and which groups and individuals they leave behind.

References

1. D. Centola, Science 329, 1194 (2010).
2. N. A. Christakis, J. H. Fowler, New England journal of medicine 357, 370 (2007).
3. J. S. Coleman, E. Katz, H. Menzel, Medical innovation: A diffusion study (Bobbs-Merrill Co, 1966).
4. E. M. Rogers, Diffusion of innovations (Simon and Schuster, 2010).
5. A. Banerjee, A. G. Chandrasekhar, E. Duflo, M. O. Jackson, Science 341, 1236498 (2013).
6. J. Cai, A. De Janvry, E. Sadoulet, American Economic Journal: Applied Economics 7, 81 (2015).
7. J. H. Fowler, N. A. Christakis, Bmj 337, a2338 (2008).
8. J. Berger, G. Le Mens, Proceedings of the National Academy of Sciences 106, 8146 (2009).
9. S. González-Bailón, J. Borge-Holthoefer, A. Rivero, Y. Moreno, Scientific reports 1, 197 (2011).
10. R. M. Bond, et al., Nature 489, 295 (2012).
11. L. Beaman, A. BenYishay, J. Magruder, A. M. Mobarak, Working Paper (2015).
12. C. B. Barrett, M. A. Constas, Proceedings of the National Academy of Sciences 111, 14625 (2014).
13. G. F. Chami, et al., Clinical Infectious Diseases 62, 200 (2016).
14. M. Alexander, L. Forastiere, S. Gupta, N. A. Christakis, Proceedings of the National Academy of Sciences 119, e2120742119 (2022).
15. S. P. Borgatti, Computational & Mathematical Organization Theory 12, 21 (2006).
16. F. Radicchi, C. Castellano, Physical Review E 95, 012318 (2017).
17. D. Kempe, J. Kleinberg, É. Tardos, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, 2003), pp. 137–146.
18. W. Chen, Y. Wang, S. Yang, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, 2009), pp. 199–208.
19. M. Kitsak, et al., Nature Physics 6, 888 (2010).
20. S. Pei, L. Muchnik, J. S. Andrade Jr, Z. Zheng, H. A. Makse, Scientific reports 4, 5547 (2014).
21. A. Y. Lokhov, D. Saad, Proceedings of the National Academy of Sciences p. 201614694 (2017).
22. S. Aral, P. S. Dhillon, Nature Human Behaviour p. 1 (2018).
23. Y. Chen, G. Paul, S. Havlin, F. Liljeros, H. E. Stanley, Physical review letters 101, 058701 (2008).
24. F. Morone, H. A. Makse, Nature 524, 65 (2015).
25. P. Clusella, P. Grassberger, F. J. Pérez-Reche, A. Politi, Physical review letters 117, 208301 (2016).
26. L. Zdeborová, P. Zhang, H.-J. Zhou, Scientific reports 6 (2016).
27. S. Mugisha, H.-J. Zhou, Physical Review E 94, 012305 (2016).
28. A. Braunstein, L. Dall’Asta, G. Semerjian, L. Zdeborová, Proceedings of the National Academy of Sciences p. 201605083 (2016).
29. N. Eagle, M. Macy, R. Claxton, Science 328, 1029 (2010).
30. R. Chetty, et al., Nature 608, 108 (2022).
31. R. Chetty, et al., Nature 608, 122 (2022).
32. B. Fish, et al., The World Wide Web Conference (2019), pp. 480–490.
33. A.-A. Stoica, A. Chaintreau, Companion Proceedings of The 2019 World Wide Web Conference (2019), pp. 569–574.
34. A.-A. Stoica, J. X. Han, A. Chaintreau, Proceedings of The Web Conference 2020 (2020), pp. 2089–2098.
35. Z. S. Jalali, W. Wang, M. Kim, H. Raghavan, S. Soundarajan, Proceedings of the 2020 SIAM International Conference on Data Mining (SIAM, 2020), pp. 613–521.
36. S. Luo, F. Morone, C. Sarraute, M. Travizano, H. A. Makse, Nature communications 8, 15227 (2017).
37. M. McPherson, L. Smith-Lovin, J. M. Cook, Annual review of sociology 27, 415 (2001).
38. Y. Leo, E. Fleury, J. I. Alvarez-Hamelin, C. Sarraute, M. Karsai, Journal of The Royal Society Interface 13, 20160598 (2016).
39. F. Karimi, M. Génois, C. Wagner, P. Singer, M. Strohmaier, Scientific reports 8 (2018).
40. W. L. Shirley, B. J. Boruff, S. L. Cutter, Hazards Vulnerability and Environmental Justice (Routledge, 2012), pp. 143–160.
41. B. L. Turner, et al., Proceedings of the national academy of sciences 100, 8074 (2003).
42. A. Tsang, B. Wilder, E. Rice, M. Tambe, Y. Zick, arXiv preprint arXiv:1903.00967 (2019).
43. X. Wang, O. Varol, T. Eliassi-Rad, Applied Network Science 7, 1 (2022).
44. H. Heidari, C. Ferrari, K. Gummadi, A. Krause, Advances in neural information processing systems 31 (2018).
45. J. Goldenberg, B. Libai, E. Muller, Marketing letters 12, 211 (2001).
46. R. Albert, H. Jeong, A.-L. Barabási, nature 406, 378 (2000).
47. A.-L. Barabási, R. Albert, science 286, 509 (1999).
48. A. D. Broido, A. Clauset, arXiv preprint arXiv:1801.03400 (2018).
49. SocioPatterns, Infectious contact networks. Http://www.sociopatterns.org/datasets/. Accessed 09/12/12.
50. L. A. Adamic, N. Glance, Proceedings of the 3rd international workshop on Link discovery (ACM, 2005), pp. 36–43.
51. T. Opsahl, P. Panzarasa, Social networks 31, 155 (2009).
52. E. Cho, S. A. Myers, J. Leskovec, Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, 2011), pp. 1082–1090.
53. J. Leskovec, J. Kleinberg, C. Faloutsos, ACM Transactions on Knowledge Discovery from Data (TKDD) 1, 2 (2007).
54. M. Akbarpour, S. Malladi, A. Saberi, Available at SSRN (2017).
55. G. F. Chami, S. E. Ahnert, N. B. Kabatereine, E. M. Tukahebwa, Proceedings of the National Academy of Sciences p. 201700166 (2017).
56. E. M. Airoldi, N. A. Christakis, Science 384, eadi5147 (2024).
57. M. Garcia-Herranz, E. Moro, M. Cebrian, N. A. Christakis, J. H. Fowler, PloS one 9, e92413 (2014).
58. Z. Obermeyer, B. Powers, C. Vogeli, S. Mullainathan, Science 366, 447 (2019).
59. M. H. Ribeiro, R. Ottoni, R. West, V. A. Almeida, W. Meira Jr, Proceedings of the 2020 conference on fairness, accountability, and transparency (2020), pp. 131–141.
60. R. L. Thomas, D. Uminsky, Patterns 3, 100476 (2022).
61. A. Ishizaka, P. Nemery, Multi-criteria decision analysis: methods and software (John Wiley & Sons, 2013).
62. United Nations General Assembly, Transforming our world: the 2030 Agenda for Sustainable Development, https://sustainabledevelopment.un.org/post2015/transformingourworld (2015).
63. N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, A. Galstyan, ACM computing surveys (CSUR) 54, 1 (2021).
64. F. Coró, G. D’angelo, Y. Velaj, ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 1 (2021).
65. J. Leskovec, et al., Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (2007), pp. 420–429.
66. C. Budak, D. Agrawal, A. El Abbadi, Proceedings of the 20th international conference on World wide web (2011), pp. 665–674.
67. C. Wagner, et al., Nature 595, 197 (2021).
68. I. Chami, S. Abu-El-Haija, B. Perozzi, C. Ré, K. Murphy, Journal of Machine Learning Research 23, 1 (2022).
69. L. Peel, T. P. Peixoto, M. De Domenico, Nature Communications 13, 6794 (2022).
70. D. M. Romero, B. Meeder, J. Kleinberg, Proceedings of the 20th international conference on World wide web (2011), pp. 695–704.
71. B. Mønsted, P. Sapieżyński, E. Ferrara, S. Lehmann, PloS one 12, e0184148 (2017).
72. D. Centola, M. Macy, American journal of Sociology 113, 702 (2007).
73. M. S. Granovetter, American journal of sociology 78, 1360 (1973).
74. C. Castellano, R. Pastor-Satorras, The European Physical Journal B 89, 243 (2016).
75. F.-A. Fortin, F.-M. De Rainville, M.-A. Gardner, M. Parizeau, C. Gagné, Journal of Machine Learning Research 13, 2171 (2012).

Acknowledgments

MGH and ID want to thank AECID (Spanish Agency for International Development Cooperation) for their support to data innovation and Frontier Data Technologies through UNICEF’s Frontier Data Network. MC wishes to acknowledge the following funding: Ministerio de Ciencia, Innovación y Universidades” y a la ”Convocatoria de la Universidad Carlos III de Madrid de Ayudas para la recualificación del sistema universitario español para 2021-2023, de 1 de julio de 2021” en base al Real Decreto 289/2021, de 20 de abril de 2021 por el que se regula la concesión directa de subvenciones a universidades públicas para la recualificación del sistema universitario español”. Furthermore, MC is thankful for the support of project “Ayuda PID2022-137243OB-I00 financiada por MCIN/AEI/10.13039/501100011033” and by “FEDER Una manera de hacer Europa”. EM acknowledges support by the National Sience Foundation under grant No. 2218748.

Methods

Independent cascade model

The ICM process is as follows: at time $t=0$ all nodes are inactive, except for a set of initial seed nodes. At each time step $t$ an activated node $i$ will contacts all its neighbors, which have previously not been activated, and try to activate them according to an independent transmission probability $p$ . After attempting to activate all its neighbors a node becomes inactive and cannot be activated again in subsequent stages of the dynamic. The process is iterated until no more active nodes remain.

Infection probability

For ICM the only parameter is the activation probability $p$ (probability of convincing people to adopt a behavior); a too high probability will correspond to a global information cascade with the full network adopting a behavior, a too low would entail no information spreading. We set $p=p_{c}$ where $p_{c}$ is the critical probability separating the region of the phase diagram where cascades (outbreaks) are subextensive ( $p<p_{c}$ ) from the supercritical region ( $p>p_{c}$ ) where outbreaks reach a finite fraction of the whole network (?). For each network we calculate the critical value of the transmission probability ( $p_{c}$ ) as the position of the maximum of the susceptibility $\langle s^{2}\rangle/\langle s\rangle^{2}$ , where $\langle s^{n}\rangle$ is the n-th moment of the outbreak size distribution computed for random selected initial single spreaders (?). See the Supplementary Information for more information.

Fair Influence Maximization

We have implemented a simple version of a non-dominated sorting genetic algorithm (NSGAIII) using the distributed evolutionary algorithms in python (DEAP) library (?). The code is freely available at [link]. Briefly, the main modelling set-up is the following:

•

Individual sets are sets of nodes.
•

Initialization is performed by generating one set of individuals with each of the heuristics mentioned in this paper and the rest completely at random.
•

Crossover of two individual sets is performed by generating the union of both sets and choosing from that sets the seeds for both new individuals at random.
•
Mutation is composed of two operators that are performed with different frequencies:
1. 1.
  
  Random where $10\%$ of the seeds are removed from a individual set and new ones selected at random.
2. 2.
  
  Tabu-like where one seed is removed at random and a certain number (or all) of random seeds are inspected for addition to the set, and the one with lowest vulnerability is finally selected.

All experiments on empirical networks were run with a population of $100$ individual sets, a mximum of $100$ generations, crossover probability of $0.8$ , mutation probability of $1$ , Tabu-like mutation frequency of $0.4$ and size of the tabu neighborhood of $20\%$ of the total number of nodes on the network.