A memory-based spatial evolutionary game with the dynamic interaction between learners and profiteers

Bin Pi School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China.    Minyu Feng College of Artificial Intelligence, Southwest University, Chongqing 400715, China.    Liang-Jian Deng liangjian.deng@uestc.edu.cn Corresponding author School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China.
(June 2, 2024)

Spatial evolutionary games provide a valuable framework for elucidating the emergence and maintenance of cooperative behavior. However, most previous studies assume that individuals are profiteers and neglect to consider the effects of memory. To bridge this gap, in this paper, we propose a memory-based spatial evolutionary game with dynamic interaction between learners and profiteers. Specifically, there are two different categories of individuals in the network, including profiteers and learners with different strategy updating rules. Notably, there is a dynamic interaction between profiteers and learners, i.e., each individual has the transition probability between profiteers and learners, which is portrayed by a Markov process. Besides, the payoff of each individual is not only determined by a single round of the game but also depends on the memory mechanism of the individual. Extensive numerical simulations validate the theoretical analysis and uncover that dynamic interactions between profiteers and learners foster cooperation, memory mechanisms facilitate the emergence of cooperative behaviors among profiteers, and increasing the learning rate of learners promotes a rise in the number of cooperators. In addition, the robustness of the model is verified through simulations across various network sizes. Overall, this work contributes to a deeper understanding of the mechanisms driving the formation and evolution of cooperation.

With the burgeoning development of artificial intelligence, reinforcement learning methods have become increasingly prevalent in exploring structured population behavior in evolutionary games. In light of the ubiquitous profit-seeking behavior observed in society and the inherent memory mechanisms of individuals, we propose a novel model in this paper, i.e., the memory-based spatial evolutionary game model with the dynamic interaction between learners and profiteers, where the memory mechanism is described by the memory length and the memory decay factor, and the dynamic interactions between learners and profiteers are modeled by a two-state homogeneous Markov chain. In addition, we conduct numerous simulations and analyses to verify the correctness of the theoretical derivations, the memory mechanism, and the dynamic interaction between two different categories of individuals on the impact of the frequency of cooperators, respectively, and study the emergence and evolution of cooperative behavior from a micro perspective.

I Introduction

Cooperative behavior has been observed across various scales, ranging from microorganisms to complex animal societies, underscoring its ubiquity in the natural world. Scholars across disciplines, including sociologists simpson2015beyond , psychologists henrich2021origins , economists niyazbekova2023sustainable , physicists guo2023third , and mathematicians sun2023state , etc. feng2023evolutionary ; li2023open have shown interest in understanding the origins and sustainability of cooperation. The network evolutionary game, as a combination of complex networks and evolutionary game theory, provides a practical framework for studying the emergence of cooperative behaviors in structured groups, where each node in a complex network represents an individual and the edges indicate the interactions between the individuals. Typical game models include the prisoner’s dilemma game wang2022levy , snowdrift game pi2022evolutionary2 , stag hunt game wang2013evolving with two players, and the public goods game wang2022replicator with multiple players. Besides, evolutionary games based on various structured populations have been widely proposed and studied, spanning square lattice networks with periodic boundaries flores2022cooperation ; szabo2016evolutionary , small-world networks lin2020evolutionary ; chen2008promotion , scale-free networks shen2024extortion ; kleineberg2017metric , temporal networks li2020evolution ; sheng2023evolutionary , and higher-order networks alvarez2021evolutionary ; kumar2021evolution . In addition to these, evolutionary games have recently gained attention and success in other areas as well capraro2024outcome .

In recent years, researchers extensively explored the underlying drivers behind the spontaneous emergence and sustenance of cooperative behaviors within competitive environments, corroborating their findings through numerous simulation experiments. A seminal contribution is the five rules proposed by Nowak nowak2006five . These rules encompass kin selection, direct reciprocity, indirect reciprocity, network reciprocity, and group selection, which elucidates diverse pathways to cooperation. Moreover, the mechanisms favoring cooperation include reputation mechanism xia2023reputation , trust xie2024trust , reward and punishment wang2014rewarding , etc. wang2018exploiting ; arefin2021imitation ; zhang2023evolutionary , which have also received extensive attention and investigation by scholars. In addition to these, many recent studies employ reinforcement learning methods to study the behavior of individuals in network evolutionary games, and Q-learning has especially become a dominant approach in these studies. For example, Yang et al. integrated Q-learning agents into the evolutionary prisoner’s dilemma game on the square lattice with periodic boundary conditions and found that interaction state Q-learning promotes the emergence and evolution of cooperation yang2024interaction . Shi and Rong delved into the dynamics of Q-learning and frequency adjusted Q-learning algorithms in multi-agent systems and revealed the intrinsic mechanisms of these algorithms from the perspective of evolutionary dynamics shi2022analysis . Ding et al. explored the impact of Q-learning on cooperation by involving extortion and observed that Q-learning significantly boosts the cooperation level of the network ding2019q . Therefore, amidst the rapid advancement of artificial intelligence, it is important to consider the influence of intelligent individuals equipped with learning on evolutionary dynamics.

In real systems, the memory of intelligent individuals significantly affects decision-making processes, and their actions are not limited to the current situation, but they also take past experiences into consideration. Some researchers discovered this phenomenon and achieved fruitful results in network evolutionary games. For example, the classical game tactics like generous-tit-for-tat (GTFT) nowak1992tit and win-stay, lose-shift (WSLS) nowak1993strategy proposed by Nowak are demonstrated to yield promising results in repeated prisoner’s dilemma games. In addition, Pi et al. considered the memory mechanism and proposed two strategy-updating rules based on profiteers and conformists and found that the memory mechanism promotes cooperation over a large parameter area pi2022evolutionary . Ma et al. examined the effect of working memory capacity, a crucial neural function, on cooperation in repeated prisoner’s dilemma experiments and discovered that the level of cooperation was optimal when subjects remembered the first two rounds of information and that there was a sudden increase in the level of cooperation as memory capacity increased from none to minimal ma2021limited . Lu et al. proposed a prisoner’s dilemma game model with a memory effect on spatial lattices and observed that the memory effect could effectively change the cooperative behavior in the spatial prisoner’s dilemma game lu2018role . Therefore, the memory mechanism of the individual performs an indispensable role in the emergence of cooperative behaviors.

Profiteers commonly adhere to the Fermi rule, a strategy updating rule where individuals are more inclined to adopt the strategy of another individual with a higher payoff perc2010coevolutionary ; yao2023inhibition ; jusup2022social . On the other hand, learners frequently employ Q-learning, a prevalent strategy updating rule where decisions are informed by previous learning experiences wang2024enhancing ; zhu2023co ; mcglohon2005learning . However, most of the previous studies simplified the scenario by considering either profiteers or learners independently, while in practice, these two categories interact continuously, i.e., the neighbor of an individual consists of both profiteers and learners, and an individual is not maintaining a category all the time. Therefore, to bridge this gap, we consider the dynamic interaction between profiteers and learners in this paper, where each individual changes from a learner (profiteer) to a profiteer (learner) with a certain probability over time, which can be described as a two-state homogeneous discrete Markov chain. The different categories of individuals are mainly reflected in the different strategy updating rules they adopt. Concretely, the profiteer uses the classical Fermi rule, preferring to imitate the strategy of individuals with higher payoffs, whereas the learner employs Q-learning in reinforcement learning and decides which strategy to utilize by continuously learning from the past. Furthermore, as we mentioned before, memory acts as a crucial role in individuals’ decision-making processes. Therefore, we introduce the memory mechanism into the evolution of the game. Specifically, the payoff of an individual is not solely dependent on a single game round but is a cumulative payoff, which is related to the memory length and the memory decay factor of the individual. This reflects the fact that an individual’s memory is not infinite and a longer event has a smaller effect on the individual. Through our investigation of the memory-based snowdrift game with the dynamic interaction between profiteers and learners on regular square lattices with periodic boundary conditions and Watts-Strogatz small-world networks watts1998collective , we find that the memory mechanism of individuals promotes the emergence and maintenance of cooperation among profiteers and the dynamic interaction between learners and profiteers enhances the cooperative behavior of the structured populations.

In the remainder of this paper, we first present the memory-based game model with the dynamic interaction between learners and profiteers in detail in Sec. II. Following that, in Sec. III, we show the simulation results and conduct thorough analyses. In the last section, we summarize the work and offer outlooks of this paper.

II Model

In this section, we introduce the memory-based spatial evolutionary game with the dynamic interaction between learners and profiteers, which is described in terms of four aspects: (i) the game model, (ii) the memory mechanism, (iii) the dynamic interactions between learners and profiteers, and (iv) the stationary distribution of the number of learners and profiteers.

II.1 Game model

In this study, we adopt the classical snowdrift game (SDG) for its generality, where the reward (R) for the interaction of two cooperative strategies is fixed to 1, the punishment (P) for the interaction of two defective strategies is set to 0, and for the interaction of cooperative and defective strategies, the cooperator receives a sucker’s payoff (S) of 1r1𝑟1-r1 - italic_r while the defector yields a temptation to defect (T) of 1+r1𝑟1+r1 + italic_r. Hence, the payoff matrix of SDG is represented as follows:

A=(11r1+r0),𝐴11𝑟1𝑟0A=\left(\begin{array}[]{cc}1&1-r\\ 1+r&0\\ \end{array}\right),italic_A = ( start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL 1 - italic_r end_CELL end_ROW start_ROW start_CELL 1 + italic_r end_CELL start_CELL 0 end_CELL end_ROW end_ARRAY ) , (1)

where r𝑟ritalic_r indicates the cost-to-benefit when both individuals are cooperators, and it takes a value ranging from 0 to 1, which is a flexible parameter.

II.2 Memory mechanism

Next, we provide a detailed explanation of the memory mechanism proposed in this paper. Each individual in the system possesses a memory, i.e. they are capable of knowing their payoffs from the previous M𝑀Mitalic_M rounds of interactions with their neighbors. Hereby, M𝑀Mitalic_M denotes the length of the individual’s memory, and the payoff in the past M𝑀Mitalic_M rounds has an impact on the individual’s current payoff. However, it is acknowledged that the impact of past interactions diminishes over time. To account for this, we introduce a memory decay factor β𝛽\betaitalic_β, which characterizes the decreasing influence of past events on an individual’s current payoff. Therefore, the payoff of individual i𝑖iitalic_i at time t𝑡titalic_t based on the memory mechanism can be expressed as

Uit=k=tM+1tβtkΠik,superscriptsubscript𝑈𝑖𝑡superscriptsubscript𝑘𝑡𝑀1𝑡superscript𝛽𝑡𝑘superscriptsubscriptΠ𝑖𝑘U_{i}^{t}=\sum_{k=t-M+1}^{t}{\beta^{t-k}\Pi_{i}^{k}},italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = italic_t - italic_M + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_β start_POSTSUPERSCRIPT italic_t - italic_k end_POSTSUPERSCRIPT roman_Π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , (2)

where ΠiksuperscriptsubscriptΠ𝑖𝑘\Pi_{i}^{k}roman_Π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT represents the actual payoff of individual i𝑖iitalic_i obtained from playing the snowdrift game with all neighbors in round k𝑘kitalic_k, and it can be calculated as

Πik=jΩisiTAsj,superscriptsubscriptΠ𝑖𝑘subscript𝑗subscriptΩ𝑖superscriptsubscript𝑠𝑖𝑇𝐴subscript𝑠𝑗\Pi_{i}^{k}=\sum_{j\in\Omega_{i}}{s_{i}^{T}As_{j}},roman_Π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ∈ roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (3)

where ΩisubscriptΩ𝑖\Omega_{i}roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT means the set consisting of all neighbors of individual i𝑖iitalic_i. We emphasize that when the evolutionary time is smaller than the memory length, i.e., t<M𝑡𝑀t<Mitalic_t < italic_M, then the individual’s memory length is considered as M=t𝑀𝑡M=titalic_M = italic_t. Besides, sxsubscript𝑠𝑥s_{x}italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT is the strategy of individual x𝑥xitalic_x, where a unit vector sx=[0,1]Tsubscript𝑠𝑥superscript01𝑇s_{x}=[0,1]^{T}italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = [ 0 , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT indicates a defective strategy and sx=[1,0]Tsubscript𝑠𝑥superscript10𝑇s_{x}=[1,0]^{T}italic_s start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = [ 1 , 0 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes cooperation.

II.3 Dynamic interactions between learners and profiteers

In reality, individuals exhibit diverse behaviors, with some driven primarily by profit-seeking motives (referred to as profiteers), while others rely on self-learning strategies (referred to as learners). Therefore, our proposed model incorporates the interaction between these two categories of individuals mentioned above. Furthermore, individuals are not fixed in their roles throughout the evolutionary game process, i.e., individuals cannot always be profiteers or learners, and they undergo mutual transitions between the two states. For this reason, we introduce a Markov process to capture this dynamic. Specifically, a profiteer (resp. learner) changes to be a learner (resp. profiteer) with the probability q𝑞qitalic_q (resp. p𝑝pitalic_p) at every moment, and the state transition matrix representing these probabilities is given by

B=(1ppq1q),𝐵1𝑝𝑝𝑞1𝑞B=\left(\begin{array}[]{cc}1-p&p\\ q&1-q\\ \end{array}\right),italic_B = ( start_ARRAY start_ROW start_CELL 1 - italic_p end_CELL start_CELL italic_p end_CELL end_ROW start_ROW start_CELL italic_q end_CELL start_CELL 1 - italic_q end_CELL end_ROW end_ARRAY ) , (4)

where p𝑝pitalic_p denotes the probability of an individual transitioning from a learner to a profiteer and 1p1𝑝1-p1 - italic_p represents the probability of remaining a learner. Analogously, q𝑞qitalic_q indicates the probability of an individual switching from a profiteer to a learner, while 1q1𝑞1-q1 - italic_q denotes the probability of staying a profiteer. We highlight that there is no relationship between the transition probabilities p𝑝pitalic_p and q𝑞qitalic_q. The only requirement is that they must satisfy the definition of probability, meaning both p𝑝pitalic_p and q𝑞qitalic_q must fall between 0 and 1.

For learners and profiteers, the key difference between them lies in their strategy updating rules. Profiteers tend to imitate the strategy of the neighbor with the highest payoff, employing the Fermi rule for strategy updating, i.e., at each round of evolution, a profiteer i𝑖iitalic_i randomly selects an individual j𝑗jitalic_j from its neighbors and adopts neighbor’s strategy according to the following probability:

Wp(sisj)=11+e(UiUj)/κ,subscript𝑊𝑝subscript𝑠𝑖subscript𝑠𝑗11superscript𝑒subscript𝑈𝑖subscript𝑈𝑗𝜅W_{p}(s_{i}\leftarrow s_{j})=\frac{1}{1+e^{(U_{i}-U_{j})/\kappa}},italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT ( italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_U start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_κ end_POSTSUPERSCRIPT end_ARG , (5)

where sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Uisubscript𝑈𝑖U_{i}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT signify the strategy expressed as unit vectors and payoff based on the memory mechanism of individual i𝑖iitalic_i, respectively. The parameter κ𝜅\kappaitalic_κ means the noise factor, which is utilized to describe the irrational choices of individuals in the game.

On the other hand, learners use a reinforcement learning algorithm known as Q-learning for strategy updates. Specifically, the Q-learning algorithm can be regarded as a Markov decision process, where the decision of an individual is only relevant to the current situation and is not influenced by past events, which can be represented by a tuple (S,A,Wl,r)𝑆𝐴subscript𝑊𝑙𝑟(S,A,W_{l},r)( italic_S , italic_A , italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_r ). Hereby, S={0C,1C,2C,,|Ωi|C}𝑆0𝐶1𝐶2𝐶subscriptΩ𝑖𝐶S=\{0C,1C,2C,\cdots,|\Omega_{i}|C\}italic_S = { 0 italic_C , 1 italic_C , 2 italic_C , ⋯ , | roman_Ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_C } (the number of cooperators among neighbors) and A={Cooperate,Defect}𝐴𝐶𝑜𝑜𝑝𝑒𝑟𝑎𝑡𝑒𝐷𝑒𝑓𝑒𝑐𝑡A=\{Cooperate,Defect\}italic_A = { italic_C italic_o italic_o italic_p italic_e italic_r italic_a italic_t italic_e , italic_D italic_e italic_f italic_e italic_c italic_t } denote the state space and action space of individual i𝑖iitalic_i. Wl:S×Ap:subscript𝑊𝑙𝑆𝐴𝑝W_{l}:S\times A\rightarrow pitalic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : italic_S × italic_A → italic_p is the state transition probability after adopting action aA𝑎𝐴a\in Aitalic_a ∈ italic_A in state sS𝑠𝑆s\in Sitalic_s ∈ italic_S, and r:S×AU:𝑟𝑆𝐴𝑈r:S\times A\rightarrow Uitalic_r : italic_S × italic_A → italic_U represents the reward of individual for performing action aA𝑎𝐴a\in Aitalic_a ∈ italic_A in state sS𝑠𝑆s\in Sitalic_s ∈ italic_S, which can be obtained by Eq. 2. Each learner owns a Q-table, which will be updated after each round of the game with all neighbors according to the following equation:

QStAt(t+1)=QStAt(t)+α[rStAt(t+1)+γmaxaAQSt+1a(t)QStAt(t)],superscriptsubscript𝑄subscript𝑆𝑡subscript𝐴𝑡𝑡1superscriptsubscript𝑄subscript𝑆𝑡subscript𝐴𝑡𝑡𝛼delimited-[]superscriptsubscript𝑟subscript𝑆𝑡subscript𝐴𝑡𝑡1𝛾subscript𝑎𝐴superscriptsubscript𝑄subscript𝑆𝑡1𝑎𝑡superscriptsubscript𝑄subscript𝑆𝑡subscript𝐴𝑡𝑡Q_{S_{t}}^{A_{t}}(t+1)=Q_{S_{t}}^{A_{t}}(t)+\alpha[r_{S_{t}}^{A_{t}}(t+1)+% \gamma\max_{a\in A}Q_{S_{t+1}}^{a}(t)-Q_{S_{t}}^{A_{t}}(t)],italic_Q start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t + 1 ) = italic_Q start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) + italic_α [ italic_r start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t + 1 ) + italic_γ roman_max start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_t ) - italic_Q start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) ] , (6)

where α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] and γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] denote the learning rate and discount factor, respectively. A smaller γ𝛾\gammaitalic_γ causes the individual to focus more on the immediate payoff, otherwise, the individual focuses more on past experiences. QStAt(t)superscriptsubscript𝑄subscript𝑆𝑡subscript𝐴𝑡𝑡Q_{S_{t}}^{A_{t}}(t)italic_Q start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t ) represents the utility obtained by the individual in state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT when taking action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t. rStAt(t+1)superscriptsubscript𝑟subscript𝑆𝑡subscript𝐴𝑡𝑡1r_{S_{t}}^{A_{t}}(t+1)italic_r start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t + 1 ) indicates the immediate payoff gained by the individual in state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t+1𝑡1t+1italic_t + 1 when performing action Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t, which is determined by Eq. 2. The term maxaAQSt+1a(t)subscript𝑎𝐴superscriptsubscript𝑄subscript𝑆𝑡1𝑎𝑡\max_{a\in A}Q_{S_{t+1}}^{a}(t)roman_max start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_t ) signifies the maximum Q-value received by the individual in the future state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Therefore, the learner learns to update the Q-table by constantly playing games with its neighbors during the evolutionary process to guide the updating of its strategy. In particular, to portray the random choice of individuals in real situations, which is a compromise between exploration and exploitation, we introduce the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm. Specifically, at each decision point, exploration occurs with probability ϵitalic-ϵ\epsilonitalic_ϵ, i.e., a strategy is randomly selected from the action set A𝐴Aitalic_A with a uniform probability distribution, and exploitation is performed with probability 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ, i.e., the strategy with the highest Q-value in the current state is taken, and if there is more than one, then one is chosen randomly.

II.4 Stationary distribution of the number of learners and profiteers

As we mentioned before, individuals are allowed to switch between profiteers and learners, with the corresponding transition matrix shown in Eq. 4. The transition probability is independent of the initial moment, which can be viewed as a two-state homogeneous Markov chain {Xn,n=0,1,2,}formulae-sequencesubscript𝑋𝑛𝑛012\{X_{n},n=0,1,2,\cdots\}{ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n = 0 , 1 , 2 , ⋯ } with state space E={Profiteer,Learner}𝐸𝑃𝑟𝑜𝑓𝑖𝑡𝑒𝑒𝑟𝐿𝑒𝑎𝑟𝑛𝑒𝑟E=\{Profiteer,Learner\}italic_E = { italic_P italic_r italic_o italic_f italic_i italic_t italic_e italic_e italic_r , italic_L italic_e italic_a italic_r italic_n italic_e italic_r }, where Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the category of each individual at time n𝑛nitalic_n. Moreover, for any i,jE𝑖𝑗𝐸i,j\in Eitalic_i , italic_j ∈ italic_E, there always exists a positive integer n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that Bi,jn0>0superscriptsubscript𝐵𝑖𝑗subscript𝑛00B_{i,j}^{n_{0}}>0italic_B start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT > 0, indicating the ergodicity of the Markov chain. By the ergodic theorem, we have:

{(π1,π2)=(π1,π2)B,i=12πi=1,casessubscript𝜋1subscript𝜋2subscript𝜋1subscript𝜋2𝐵otherwisesuperscriptsubscript𝑖12subscript𝜋𝑖1otherwise\begin{cases}\left(\pi_{1},\pi_{2}\right)=\left(\pi_{1},\pi_{2}\right)B,\\ \sum_{i=1}^{2}{\pi_{i}}=1,\\ \end{cases}{ start_ROW start_CELL ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_B , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 , end_CELL start_CELL end_CELL end_ROW (7)

where (π1,π2)subscript𝜋1subscript𝜋2(\pi_{1},\pi_{2})( italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) indicates the stationary distribution of the Markov chain {Xn,n=0,1,2,}formulae-sequencesubscript𝑋𝑛𝑛012\{X_{n},n=0,1,2,\cdots\}{ italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_n = 0 , 1 , 2 , ⋯ }. Substituting Eq. 4 into Eq. 7, we obtain the solution π1=q/(p+q)subscript𝜋1𝑞𝑝𝑞\pi_{1}=q/(p+q)italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_q / ( italic_p + italic_q ) and π2=p/(p+q)subscript𝜋2𝑝𝑝𝑞\pi_{2}=p/(p+q)italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_p / ( italic_p + italic_q ), which implies that as evolutionary time extends indefinitely, the probability of an individual being a learner is π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while the probability of being a profiteer is π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Furthermore, based on the fact that whether each individual is a profiteer or a learner is independent of the other individuals, i.e., the categories of individuals are independent of each other, we can yield the expected number of profiteers and learners in the network when the evolution stabilizes, as expressed in the following equation:

{Elearner=N×π1=Nqp+q,Eprofiteer=N×π2=Npp+q,casessubscript𝐸𝑙𝑒𝑎𝑟𝑛𝑒𝑟𝑁subscript𝜋1𝑁𝑞𝑝𝑞otherwisesubscript𝐸𝑝𝑟𝑜𝑓𝑖𝑡𝑒𝑒𝑟𝑁subscript𝜋2𝑁𝑝𝑝𝑞otherwise\begin{cases}E_{learner}=N\times\pi_{1}=\frac{Nq}{p+q},\\ E_{profiteer}=N\times\pi_{2}=\frac{Np}{p+q},\\ \end{cases}{ start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_l italic_e italic_a italic_r italic_n italic_e italic_r end_POSTSUBSCRIPT = italic_N × italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_N italic_q end_ARG start_ARG italic_p + italic_q end_ARG , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_p italic_r italic_o italic_f italic_i italic_t italic_e italic_e italic_r end_POSTSUBSCRIPT = italic_N × italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_N italic_p end_ARG start_ARG italic_p + italic_q end_ARG , end_CELL start_CELL end_CELL end_ROW (8)

where N𝑁Nitalic_N represents the size of the network.

To provide a clearer understanding of our model, we present an illustrative example of the model in Fig. 1. Each individual possesses a probability of transitioning between being a profiteer and a learner, showcasing dynamic interactions between the two categories in the network. Profiteers and learners utilize Fermi rules and Q-learning to update their strategies, respectively. Notably, each learner maintains a Q-table. Taking individual v5subscript𝑣5v_{5}italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT highlighted within the red square as an example, it is surrounded by two cooperators and two defectors among its neighbors. According to the Q-table, we can get that the Q-value of adopting a cooperative strategy in the current state is 17.72, surpassing the Q-value (13.92) for choosing a defective strategy. Consequently, individual v5subscript𝑣5v_{5}italic_v start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT will opt for the cooperative strategy during exploitation at the next moment.

Refer to caption
Figure 1: An illustration of the model. In this figure, we provide an example of the proposed model. Profiteers and learners interact dynamically in the network, with different categories of individuals adopting different strategy updating rules to update their strategies. Besides, the category of each individual changes from learner (resp. profiteer) to profiteer (resp. learner) with the probability p𝑝pitalic_p (resp. q𝑞qitalic_q).

III Simulation results and analysis

In this section, we aim to validate the impact of the proposed model on the evolution of cooperative behavior through numerical simulations and provide an analysis of the results. We conduct simulations on two categories of networks: (i) regular square lattice (SL) with periodic boundary conditions and von Neumann neighborhood with the network size of N=50×50𝑁5050N=50\times 50italic_N = 50 × 50; (ii) Watts-Strogatz small-world network (WS) with N=2500𝑁2500N=2500italic_N = 2500, where each individual is initially connected to its 2 nearest neighbors on the left and right, and with a reconnection probability of 0.2. Initially, each individual is assigned to defect or cooperate with all its neighbors with a coin toss. Subsequently, individuals update their strategies using the reinforcement learning method called ϵitalic-ϵ\epsilonitalic_ϵ-greedy Q-learning or the Fermi rule based on their categories, and the probability that a learner takes an exploration when updating the strategy is set to ϵ=0.1italic-ϵ0.1\epsilon=0.1italic_ϵ = 0.1. We primarily focus on the evolution between profiteers and learners, the emergence of cooperative behaviors, and a microscopic view of the distribution of cooperators and defectors. In addition, we validate the robustness of the model by performing simulations on networks with different sizes. According to our simulations, the evolution of the cooperation frequency is stable after 1000 steps of iterations. Therefore, in all simulations, we average the last 500 of the entire 5000 time steps to obtain the result of one simulation. Additionally, to mitigate interference from other factors, we perform 20 independent simulations and take the average value of them to get the final outcome.

III.1 Evolution and statistics of the number of profiteers and learners

In order to verify the theory of dynamic interaction between learners and profiteers proposed in this paper, we first plot the evolutionary curves of the number of learners over time under different transition probabilities of p𝑝pitalic_p and q𝑞qitalic_q, and the result is shown in Fig. 2(a), where the red straight line indicates the theoretical value, which is calculated according to Eq. 8. We do not plot the evolution of profiteers since it can be obtained by subtracting the number of learners from the size of the network. From Fig. 2(a), we can see that the number of learners in all three cases evolves over time and stabilizes around t=100𝑡100t=100italic_t = 100, subsequently fluctuating around a certain value, which closely aligns with the one marked by the red theoretical straight line. Additionally, we present the theoretical and simulated values of the number of learners and profiteers, along with the relative error between them in Tab. 1 in the form of data for the three different scenarios. This provides a more intuitive insight into the gap between theory and simulation. The theoretical value is calculated using Eq. 8, while the simulated value is obtained by averaging the last 1000 steps of the evolution curve in Fig. 2(a). The relative error is determined by e=|xx|/x𝑒𝑥superscript𝑥superscript𝑥e=|x-x^{*}|/x^{*}italic_e = | italic_x - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | / italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, where xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and x𝑥xitalic_x denote the theoretical and simulated values, respectively. We find that the theoretical and simulated results for all three situations are very close to each other, and the maximum relative error is only 0.096%, which confirms the correctness of our theoretical derivation. Furthermore, by comparing the number of learners and profiteers at the stable time under the three cases, we observe that increasing p𝑝pitalic_p leads to an increase in the number of profiteers, while a larger q𝑞qitalic_q results in more learners in the network, which is consistent with the results demonstrated in Fig. 2(a).

Refer to caption
(a) Evolutionary curves
Refer to caption
(b) Statistical distributions
Figure 2: Evolutionary curves and statistical distributions of the number of learners under different transition probabilities. The green squares represent the results under p=0.5,q=0.5formulae-sequence𝑝0.5𝑞0.5p=0.5,q=0.5italic_p = 0.5 , italic_q = 0.5, while the blue diamonds and black triangles stand for the results under p=0.5,q=0.8formulae-sequence𝑝0.5𝑞0.8p=0.5,q=0.8italic_p = 0.5 , italic_q = 0.8 and p=0.8,q=0.5formulae-sequence𝑝0.8𝑞0.5p=0.8,q=0.5italic_p = 0.8 , italic_q = 0.5, respectively, with only one parameter different from green. The red line indicates the theoretical result derived from Eq. 8. We set the x𝑥xitalic_x-axis and y𝑦yitalic_y-axis as evolutionary time vs. number of learners for (a) evolutionary curves. In (b) statistical distributions, the x𝑥xitalic_x-axis represents the number of learners, while the y𝑦yitalic_y-axis shows the corresponding probability. It can be seen that the number of learners becomes stable around t=100𝑡100t=100italic_t = 100 and the numerical simulation results are in agreement with the theory.

Subsequently, we record the number of learners in the last 4000 steps in each scenario and plot the probability distribution of the number of learners under three sets of transition probabilities according to the law of large numbers by considering the frequency as probability. The result is shown in Fig. 2(b), where the red straight line represents the theoretical values. Similarly, we present some numerical characteristics for each of the three distributions in Tab. 2, including standard deviation, skewness, and kurtosis, where the standard deviation is calculated using the std()𝑠𝑡𝑑std()italic_s italic_t italic_d ( ) method in the numpy𝑛𝑢𝑚𝑝𝑦numpyitalic_n italic_u italic_m italic_p italic_y library in Python, while the skewness and kurtosis are obtained employing the skew()𝑠𝑘𝑒𝑤skew()italic_s italic_k italic_e italic_w ( ) and kurt()𝑘𝑢𝑟𝑡kurt()italic_k italic_u italic_r italic_t ( ) methods in the pandas𝑝𝑎𝑛𝑑𝑎𝑠pandasitalic_p italic_a italic_n italic_d italic_a italic_s library, respectively. We observe that all three distributions approximately follow a normal distribution and that the span of each distribution is relatively small, explaining the small standard deviation of each distribution, which is consistent with the result in Tab. 2. Moreover, it can be clearly seen that the red theoretical straight line exactly passes through the tip of each distribution, which also verifies the correctness of our theory in another way. The skewness of all three statistical distributions of the number of learners exhibited in Tab. 2 is less than 0. It indicates that all three distributions have negative skewness, i.e., the left skewness, where there are fewer data located on the left side of the mean than on the right side of the mean, whereas profiteers are the reverse. The kurtosis of learners under both p=0.8,q=0.5formulae-sequence𝑝0.8𝑞0.5p=0.8,q=0.5italic_p = 0.8 , italic_q = 0.5 and p=0.5,q=0.5formulae-sequence𝑝0.5𝑞0.5p=0.5,q=0.5italic_p = 0.5 , italic_q = 0.5 is less than 0, which means that the overall data distribution is relatively flat compared to the normal distribution and is platykurtic, while the kurtosis under p=0.5,q=0.8formulae-sequence𝑝0.5𝑞0.8p=0.5,q=0.8italic_p = 0.5 , italic_q = 0.8 is greater than 0, which implies that the overall data distribution is relatively steep compared to the normal distribution and is leptokurtic. It is worth noting that the standard deviation and kurtosis are the same for both profiteers and learners, while the skewness is just the opposite. This discrepancy arises because the network contains only two categories of individuals, profiteers and learners, and the standard deviation and kurtosis are essentially even-ordered central moments, whereas the skewness is an odd-ordered central moment.

Table 1: The comparison of theoretical and simulated results for the number of learners and profiteers
Results p=0.8,q=0.5formulae-sequence𝑝0.8𝑞0.5p=0.8,q=0.5italic_p = 0.8 , italic_q = 0.5 p=0.5,q=0.5formulae-sequence𝑝0.5𝑞0.5p=0.5,q=0.5italic_p = 0.5 , italic_q = 0.5 p=0.5,q=0.8formulae-sequence𝑝0.5𝑞0.8p=0.5,q=0.8italic_p = 0.5 , italic_q = 0.8
Learner Profiteer Learner Profiteer Learner Profiteer
Theoretical values 961.538 1538.462 1250 1250 1538.462 961.538
Simulated values 961.379 1538.621 1248.799 1251.201 1538.403 961.597
Relative error 0.017% 0.010% 0.096% 0.096% 0.004% 0.006%
Table 2: The numerical characteristics for the distribution of the number of learners and profiteers
Results p=0.8,q=0.5formulae-sequence𝑝0.8𝑞0.5p=0.8,q=0.5italic_p = 0.8 , italic_q = 0.5 p=0.5,q=0.5formulae-sequence𝑝0.5𝑞0.5p=0.5,q=0.5italic_p = 0.5 , italic_q = 0.5 p=0.5,q=0.8formulae-sequence𝑝0.5𝑞0.8p=0.5,q=0.8italic_p = 0.5 , italic_q = 0.8
Learner Profiteer Learner Profiteer Learner Profiteer
Standard deviation 24.148 24.148 25.905 25.905 24.999 24.999
Skewness -0.038 0.038 -0.021 0.021 -0.081 0.081
Kurtosis -0.144 -0.144 -0.223 -0.223 0.451 0.451

III.2 Emergence and evolution of cooperative behavior

In this section, we investigate the effect of some model parameters on the emergence and evolution of cooperative behaviors. Primarily, we present the heat map of the cooperation ratio with respect to the transition probabilities p𝑝pitalic_p and q𝑞qitalic_q between profiteers and learners, as shown in Fig. 3. The x𝑥xitalic_x-axis is set as q𝑞qitalic_q with the range [0, 1], which denotes the transition probability from profiteers to learners, and the y𝑦yitalic_y-axis is set as p𝑝pitalic_p with the same range, which means the transition probability from learners to profiteers. Both networks exhibit that when p𝑝pitalic_p is fixed, increasing the transition probability q𝑞qitalic_q leads to a rise in the proportion of cooperators, i.e., the incorporation of learners promotes the emergence of cooperative behavior compared to a population of pure profiteers. This insight offers a fresh perspective to explain the widespread cooperative behavior in real-world scenarios. Although many individuals strive to behave as profiteers by imitating the strategies of individuals with high payoffs, there are still learners who are self-learning based on their past experiences, and the existence of learners is exactly the reason that further enhances the proportion of cooperators in the population.

Refer to caption
(a) SL network
Refer to caption
(b) WS network
Figure 3: Heat maps of cooperation ratio regarding transition probabilities p𝑝pitalic_p and q𝑞qitalic_q. This figure elucidates the impact of transition probabilities p𝑝pitalic_p (y𝑦yitalic_y-axis) and q𝑞qitalic_q (x𝑥xitalic_x-axis) between profiteers and learners on cooperative behavior on the (a) SL and (b) WS networks. The parameters for the cost-to-benefit of SDG, memory decay factor, and memory length on both networks are fixed to r=0.6𝑟0.6r=0.6italic_r = 0.6, β=0.5𝛽0.5\beta=0.5italic_β = 0.5, and M=5𝑀5M=5italic_M = 5 respectively. These heat maps provide a comprehensive visualization of how varying p𝑝pitalic_p and q𝑞qitalic_q affect cooperative dynamics within different network topologies. We observe that the inclusion of learners boosts the frequency of cooperators compared to pure profiteers.

Subsequently, we delve into the impact of the memory mechanisms proposed in this paper on the evolution of cooperative behavior. We set the transition probability between profiteers and learners as p=q=0.5𝑝𝑞0.5p=q=0.5italic_p = italic_q = 0.5, thus we can get that the profiteers and learners in the network are uniformly mixed according to Eq. 8. In this case, we plot the heat map of the cooperation ratio concerning the memory length and the memory decay factor, and the results of SL and WS networks are respectively shown in Figs. 4(a) and 4(b), from which we can obtain that increasing both the memory length M𝑀Mitalic_M and the memory decay factor β𝛽\betaitalic_β causes a decrease in the number of cooperators for both SL and WS networks.

Refer to caption
(a) SL network
Refer to caption
(b) WS network
Figure 4: Heat maps of cooperation frequency regarding memory length and memory decay factor under the coexistence of profiteers and learners. In this figure, we illustrate the influence of memory decay factor β𝛽\betaitalic_β and memory length M𝑀Mitalic_M on the frequency of cooperators on the SL (in panel (a)) and WS (in panel (b)) networks under the coexistence of profiteers and learners. We set the transition probabilities between learners and profiteers to p=q=0.5𝑝𝑞0.5p=q=0.5italic_p = italic_q = 0.5, ensuring a balanced presence of learners and profiteers in the network. Additionally, the payoff parameter of the SDG is set to r=0.3𝑟0.3r=0.3italic_r = 0.3. These heat maps offer insights into how different combinations of memory parameters influence cooperative behavior in networks with varied structures. Both networks show that increasing memory length and memory decay factor inhibit cooperative behavior.

In addition, we demonstrate the changes in the proportion of cooperators concerning the memory mechanism for the pure profiteer scenario in Fig. 5, where the probability of converting from learner to profiteer is set to p=1𝑝1p=1italic_p = 1. In contrast, the probability of changing from profiteer to learner is q=0𝑞0q=0italic_q = 0. All other parameters are the same as in the numerical simulation in Fig. 4. It is obvious that the memory mechanism greatly facilitates the emergence of cooperative behavior in the case of pure profiteers. Notably, pure cooperators even appear on both networks when the memory length and memory decay factor are relatively large. Furthermore, upon comparing Figs. 5(a) and 5(b), we can observe that the area of the pure cooperators’ region of WS shown in Fig. 5(b) is significantly larger than that of SL presented in Fig. 5(a), which suggests that the WS network is more favorable to the survival of cooperators than the SL network. Therefore, we can conclude that while the memory mechanism facilitates the emergence of cooperative behavior in profiteers, it inhibits the evolution of cooperation in learners.

Refer to caption
(a) SL network
Refer to caption
(b) WS network
Figure 5: Heat maps of cooperation frequency regarding memory length and memory decay factor in the case of pure profiteers. This figure presents the influence of memory decay factor β𝛽\betaitalic_β (y𝑦yitalic_y-axis) and memory length M𝑀Mitalic_M (x𝑥xitalic_x-axis) on the cooperative behavior on the SL (in subplot (a)) and WS (in subplot (b)) networks under conditions of pure profiteers. Hereby, transition probabilities between learners and profiteers are fixed to p=1𝑝1p=1italic_p = 1 and q=0𝑞0q=0italic_q = 0, resulting in all the individuals in the network being profiteers, with no learners existing. Furthermore, the payoff parameter of SDG is set to r=0.3𝑟0.3r=0.3italic_r = 0.3. These heat maps provide insights into how memory parameters influence cooperative behavior in networks dominated by profiteers. It is evident that memory mechanisms favor the survival of cooperators in groups of pure profiteers.
Refer to caption
(a) SL network
Refer to caption
(b) WS network
Figure 6: Heat maps of cooperation percentage with respect to learning rate and discount factor in Q-learning. In this figure, we depict the impact of learning rate α𝛼\alphaitalic_α and discount factor γ𝛾\gammaitalic_γ on the ratio of cooperators on the SL (in panel (a)) and WS (in panel (b)) networks. We set the xlimit-from𝑥x-italic_x -axis and ylimit-from𝑦y-italic_y -axis in each subgraph to γ𝛾\gammaitalic_γ and α𝛼\alphaitalic_α with the range [0, 1]. These heat maps offer insights into how different combinations of learning rates and discount factors affect cooperative behavior in the two types of networks. It can be seen that a lower learning rate and a larger discount factor cause more defectors to invade the cooperators.
Refer to caption
(a) SL network
Refer to caption
(b) WS network
Figure 7: Heat maps of cooperation ratio concerning exploration rate and payoff parameter. In this figure, we study the influence of exploration rate ϵitalic-ϵ\epsilonitalic_ϵ and payoff parameter r𝑟ritalic_r on the cooperative behavior within the SL (panel (a)) and WS (panel (b)) networks. The xlimit-from𝑥x-italic_x -axis and ylimit-from𝑦y-italic_y -axis in each subfigure represent r𝑟ritalic_r and ϵitalic-ϵ\epsilonitalic_ϵ with a range [0, 1]. These visualizations offer insights into how variations in ϵitalic-ϵ\epsilonitalic_ϵ and r𝑟ritalic_r affect cooperative dynamics on the SL and WS networks. We get that an appropriate exploration rate promotes cooperative behavior when r𝑟ritalic_r is small, while increasing the exploration rate inhibits the maintenance of cooperators when r𝑟ritalic_r is large.

It is also crucial to study the effect of learners’ learning rate α𝛼\alphaitalic_α and discount factor γ𝛾\gammaitalic_γ in Eq. 6 on cooperative behavior. Thus, we exhibit the heat map depicting the percentage of cooperators on the SL and WS networks concerning α𝛼\alphaitalic_α and γ𝛾\gammaitalic_γ in Figs. 6(a) and 6(b), respectively. We set the payoff parameter, the transition probabilities between profiteers and learners, the memory decay factor, and the memory length to r=0.5,p=q=0.5,β=0.5,formulae-sequenceformulae-sequence𝑟0.5𝑝𝑞0.5𝛽0.5r=0.5,p=q=0.5,\beta=0.5,italic_r = 0.5 , italic_p = italic_q = 0.5 , italic_β = 0.5 , and M=5𝑀5M=5italic_M = 5. From Figs. 6(a) and 6(b), we can obtain that both SL and WS networks demonstrate a higher learning rate α𝛼\alphaitalic_α leads to a larger percentage of cooperators. On the contrary, increasing the discount factor γ𝛾\gammaitalic_γ results in a decrease in the cooperation frequency. This indicates that a larger learning rate and a greater tendency of learners toward immediate benefits promote the emergence and maintenance of cooperative behaviors in the network.

Refer to caption
Figure 8: Evolutionary curves of cooperation frequency and corresponding snapshots under different memory decay factors and payoff parameters. Subfigure (a) shows the evolution of cooperation frequency fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as time progresses for three distinct combinations of memory decay factor β𝛽\betaitalic_β and payoff parameter r𝑟ritalic_r: (0.9, 0.3), (0.1, 0.5), and (0.9, 0.5). The snapshots of individual strategy distributions on the SL networks at t=1,10,100𝑡110100t=1,10,100italic_t = 1 , 10 , 100 and 1000 are depicted in (b)-(e) for (β,r)𝛽𝑟(\beta,r)( italic_β , italic_r ) = (0.9, 0.3), in (f)-(i) for (β,r)𝛽𝑟(\beta,r)( italic_β , italic_r ) = (0.9, 0.5), and in (j)-(m) for (β,r)𝛽𝑟(\beta,r)( italic_β , italic_r ) = (0.1, 0.5). The memory length is fixed at M=10𝑀10M=10italic_M = 10, and the transition probabilities between learners and profiteers are set to p=1𝑝1p=1italic_p = 1 and q=0𝑞0q=0italic_q = 0. We can yield that the proportion of cooperators stabilizes around t=100𝑡100t=100italic_t = 100, suggesting that cooperators reach a stable state in the evolutionary process. Besides, cooperators resist the invasion of defectors primarily by forming clusters.

Next, we explore the influence of exploration rate ϵitalic-ϵ\epsilonitalic_ϵ and payoff parameter r𝑟ritalic_r on the cooperative behavior, and the results are depicted in Fig. 7. With transition probabilities between profiteers and learners set to p=0.4,q=0.8formulae-sequence𝑝0.4𝑞0.8p=0.4,q=0.8italic_p = 0.4 , italic_q = 0.8, and memory mechanism parameters held constant at β=0.5𝛽0.5\beta=0.5italic_β = 0.5 and M=5𝑀5M=5italic_M = 5, both WS and SL networks reveal interesting insights. For relatively small values of the payoff parameter (r<0.3𝑟0.3r<0.3italic_r < 0.3), an appropriate increase in the exploration rate fosters cooperation, but excessive exploration hampers cooperative behavior. However, when r𝑟ritalic_r is relatively large (r>0.3𝑟0.3r>0.3italic_r > 0.3), elevating the exploration rate consistently diminishes the prevalence of cooperators. Furthermore, we observe that an increase in the payoff parameter always reduces the fraction of cooperators, as higher r𝑟ritalic_r values lead to greater payoffs for defectors based on Eq. 1, thereby incentivizing more individuals to adopt the defective strategy.

III.3 Snapshots of the evolution of cooperators on SL networks

Then, we display the evolutionary curves of the cooperation ratio under three different parameter pairs (β𝛽\betaitalic_β, r𝑟ritalic_r) and illustrate the distribution of cooperators and defectors on the SL networks at different instants from a micro perspective in Fig. 8. From Fig. 8(a), it is evident that the frequency of cooperators fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT stabilizes around t=100𝑡100t=100italic_t = 100 in all three situations, and then stabilizes with slight fluctuations around a certain value. In addition, we observe that the cooperation ratio of β=0.9,r=0.5formulae-sequence𝛽0.9𝑟0.5\beta=0.9,r=0.5italic_β = 0.9 , italic_r = 0.5 marked by blue squares is higher than that of β=0.1,r=0.5formulae-sequence𝛽0.1𝑟0.5\beta=0.1,r=0.5italic_β = 0.1 , italic_r = 0.5 labeled by green triangles, and lower than that of β=0.9,r=0.3formulae-sequence𝛽0.9𝑟0.3\beta=0.9,r=0.3italic_β = 0.9 , italic_r = 0.3 marked by black circles, which can be visually confirmed from the snapshots depicted in Figs. 8(b)-(m), where red represents the defector and blue indicates the cooperator.

Specifically, Figs. 8(b), (f), and (j) all demonstrate that the network is almost evenly mixed with cooperators and defectors at t=1𝑡1t=1italic_t = 1, while the blue region for β=0.9,r=0.3formulae-sequence𝛽0.9𝑟0.3\beta=0.9,r=0.3italic_β = 0.9 , italic_r = 0.3 (Figs. 8(b)-(e)) grows large as time progresses and eventually occupies the entire network, indicating that all the individuals become cooperators, and there are no defectors exist. For β=0.9,r=0.5formulae-sequence𝛽0.9𝑟0.5\beta=0.9,r=0.5italic_β = 0.9 , italic_r = 0.5 (Figs. 8(f)-(i)), the blue region gradually increases and eventually stabilizes, signifying that cooperators play a dominant role in it. In contrast, the blue region gradually diminishes over time and eventually stabilizes for β=0.1,r=0.5formulae-sequence𝛽0.1𝑟0.5\beta=0.1,r=0.5italic_β = 0.1 , italic_r = 0.5 (Figs. 8(j)-(m)), meaning that the defector holds a dominant role in it. Additionally, we can observe that the cooperators resist the invasion of the defectors mainly by forming clusters. These observations can be attributed to the fact that the transition probabilities between profiteers and learners are set to p=1𝑝1p=1italic_p = 1 and q=0𝑞0q=0italic_q = 0, i.e., the network is ultimately all occupied by profiteers. Moreover, we can get that increasing the memory decay factor β𝛽\betaitalic_β promotes cooperative behaviors as indicated in Fig. 5. Besides, according to Eq. 1, when a cooperator meets a defector, the defector’s payoff increases but the cooperator’s payoff decreases as the payoff parameter r𝑟ritalic_r grows, which promotes the emergence and maintenance of defectors in the network.

III.4 Validation of the robustness of the model

Note that in all of our previous simulations, the sizes of the SL and WS networks are fixed to 50×50505050\times 5050 × 50. In this subsection, we aim to examine the evolution of cooperative behaviors on SL and WS networks with different sizes to verify the robustness of the model. We depict the variation of the cooperation frequency fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT on SL and WS networks as a function of network size N𝑁Nitalic_N for three different sets of parameters (M𝑀Mitalic_M, β𝛽\betaitalic_β) and three distinct groups of parameters (r𝑟ritalic_r, p𝑝pitalic_p) in Figs. 9(a) and 9(b), respectively.

From Fig. 9, we obtain that both SL and WS networks exhibit almost no influence of network size on cooperative behavior under the same set of parameters, which is further demonstrated by the range and standard deviation of each curve, and the results are shown in Tabs. 3 and 4. The reason we choose the range and standard deviation as our statistical metrics is that range indicates the extent of fluctuation in the proportion of cooperators and standard deviation quantifies the magnitude of fluctuation in the frequency of cooperators. A small range and standard deviation signify that network size has a negligible impact on the frequency of cooperators, validating the robustness of our findings across different network sizes. It can be seen that the maximum range and standard deviation of SL do not exceed 0.0402 and 0.0113, respectively, and the maximum range and standard deviation of WS are within 0.0324 and 0.0089, respectively, which signifies that the fluctuation ranges and magnitudes of fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT on both networks are minimal. Notably, the fluctuation ranges and magnitudes on the WS network are even smaller than those on the SL network, consistent with the trends presented in Fig. 9. Furthermore, by comparing Figs. 9(a) and 9(b), we notice that the WS network exhibits a higher percentage of cooperators for the parameter set (M𝑀Mitalic_M, β𝛽\betaitalic_β) (denoted as dashed lines) compared to the SL network, while both networks show similar cooperation ratio for the parameter set (r𝑟ritalic_r, p𝑝pitalic_p) (denoted as solid lines). Therefore, through this simulation, we can conclude that the results are consistent across different network sizes for SL and WS networks, confirming the robustness of the proposed model.

Refer to caption
(a) SL network
Refer to caption
(b) WS network
Figure 9: Variation in the proportion of cooperators with network size under different conditions. This figure illustrates the frequency of cooperators in dependence of network size under different parameter pairs (M𝑀Mitalic_M, β𝛽\betaitalic_β) and (r𝑟ritalic_r, p𝑝pitalic_p) on the SL (in subplot (a)) and WS (in subplot (b)) networks. For the (M𝑀Mitalic_M, β𝛽\betaitalic_β) scenario, we set the transition probabilities between learners and profiteers to p=1𝑝1p=1italic_p = 1 and q=0𝑞0q=0italic_q = 0, payoff parameter to r=0.5𝑟0.5r=0.5italic_r = 0.5. In the (r𝑟ritalic_r, p𝑝pitalic_p) situation, the transition probability of the profiteer to the learner is set to q=0.5𝑞0.5q=0.5italic_q = 0.5, and parameters about the memory mechanism are set to β=0.6𝛽0.6\beta=0.6italic_β = 0.6 and M=10𝑀10M=10italic_M = 10. It is obvious that the fluctuation range and amplitude of the cooperation frequency concerning network size are minimal, which verifies the robustness of the model.
Table 3: The range and standard deviation of SL network
Results M=5,β=0.3formulae-sequence𝑀5𝛽0.3M=5,\beta=0.3italic_M = 5 , italic_β = 0.3 M=10,β=0.6formulae-sequence𝑀10𝛽0.6M=10,\beta=0.6italic_M = 10 , italic_β = 0.6 M=15,β=0.9formulae-sequence𝑀15𝛽0.9M=15,\beta=0.9italic_M = 15 , italic_β = 0.9 r=0.2,p=0.3formulae-sequence𝑟0.2𝑝0.3r=0.2,p=0.3italic_r = 0.2 , italic_p = 0.3 r=0.5,p=0.6formulae-sequence𝑟0.5𝑝0.6r=0.5,p=0.6italic_r = 0.5 , italic_p = 0.6 r=0.8,p=0.9formulae-sequence𝑟0.8𝑝0.9r=0.8,p=0.9italic_r = 0.8 , italic_p = 0.9
Range 0.0036 0.0037 0.0402 0.0017 0.0107 0.0054
Standard deviation 0.0010 0.0011 0.0113 0.0004 0.0031 0.0016
Table 4: The range and standard deviation of WS network
Results M=5,β=0.3formulae-sequence𝑀5𝛽0.3M=5,\beta=0.3italic_M = 5 , italic_β = 0.3 M=10,β=0.6formulae-sequence𝑀10𝛽0.6M=10,\beta=0.6italic_M = 10 , italic_β = 0.6 M=15,β=0.9formulae-sequence𝑀15𝛽0.9M=15,\beta=0.9italic_M = 15 , italic_β = 0.9 r=0.2,p=0.3formulae-sequence𝑟0.2𝑝0.3r=0.2,p=0.3italic_r = 0.2 , italic_p = 0.3 r=0.5,p=0.6formulae-sequence𝑟0.5𝑝0.6r=0.5,p=0.6italic_r = 0.5 , italic_p = 0.6 r=0.8,p=0.9formulae-sequence𝑟0.8𝑝0.9r=0.8,p=0.9italic_r = 0.8 , italic_p = 0.9
Range 0.0222 0.0324 0.0020 0.0049 0.0053 0.0050
Standard deviation 0.0062 0.0089 0.0006 0.0013 0.0014 0.0014

IV Conclusion and outlook

In this study, we investigate the evolution of cooperation on SL and WS networks with dynamic interactions between learners and profiteers, considering the category of individuals as a homogeneous discrete-time Markov chain with two states. Different categories of individuals update their strategies according to different rules, where learners adopt Q-learning while profiteers follow the Fermi rule. Additionally, we introduce the memory decay factor and memory length to enable individuals to compute payoffs based on a broader historical context rather than solely relying on the current round of the game. In the simulation, we begin with plotting the evolutionary curve and statistical distribution of learners and verify the theoretical analyses through various numerical features. Subsequently, we perform numerous simulations about the effect of the proposed model on cooperative behavior on the SL and WS networks. We find that dynamic interactions between profiteers and learners promote cooperation, increasing learning rate and decreasing discount factor in Q-learning increase the proportion of network cooperators, and memory mechanisms enhance the emergence of cooperation in pure profiteer groups. Then, we perform snapshots of the evolution of cooperators on SL networks and focus on the formation and evolution of the cooperation clusters from a micro perspective. We discover that cooperators resist the invasion of defectors mainly by forming cluster structures, with smaller payoff parameters and larger memory decay coefficients leading to larger clusters. Furthermore, our simulations on networks of varying sizes demonstrate the robustness of the model, revealing that network size has a negligible impact on the cooperation ratio on both SL and WS networks.

Based on our work, there are some extensions to further exploration in spatial evolutionary games. For example, we mainly focus on the dynamic interactions between profiteers and learners on static networks, while there actually exist structured networks that undergo dynamic changes over time holme2012temporal ; li2020evolution . These networks feature evolving interactions among individuals, presenting an intriguing opportunity to extend our model to investigate temporal networks. In addition to examining complete interactions, more and more scholars have recently turned their attention to stochastic and incomplete games li2021evolution ; li2022impact ; wang2021evolution , which is also worth considering in our future research endeavors. Moreover, some previous studies have pointed out that real-world scenarios often involve conformist behaviors pi2022evolutionary ; szolnoki2015conformity . Therefore, future research could explore the introduction of conformists and assess the impact of the three categories of individual dynamic interactions on the maintenance and evolution of cooperative behavior.


This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 62206230 and No. 12271083 and in part by the Natural Science Foundation of Sichuan Province under Grant No. 2022NSFSC0501.



