Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

The computational roots of positivity and confirmation biases in reinforcement learning

Trends in Cognitive Sciences , 2022
Humans do not integrate new information objectively: outcomes carrying a positive affective value and evidence confirming one’s own prior belief are overweighed. Until recently, theoretical and empirical accounts of the positivity and confirmation biases assumed them to be specific to ‘high-level’ belief updates. We present evidence against this account. Learning rates in reinforcement learning (RL) tasks, estimated across different contexts and species, generally present the same characteristic asymmetry, suggesting that belief and value updating processes share key computational principles and distortions. This bias generates over-optimistic expectations about the probability of making the right choices and, consequently, generates over-optimistic reward expectations. We discuss the normative and neurobiological roots of these RL biases and their position within the greater picture of behavioral decision-making theories....Read more
Review The computational roots of positivity and conrmation biases in reinforcement learning Stefano Palminteri 1,2,3, * and Maël Lebreton 4,5,6, * Humans do not integrate new information objectively: outcomes carrying a positive affective value and evidence conrming ones own prior belief are overweighed. Until recently, theoretical and empirical accounts of the positivity and conrmation biases assumed them to be specic to high-levelbelief updates. We present evidence against this account. Learning rates in reinforcement learning (RL) tasks, estimated across different contexts and species, generally present the same char- acteristic asymmetry, suggesting that belief and value updating processes share key computational principles and distortions. This bias generates over-optimistic expectations about the probability of making the right choices and, consequently, generates over-optimistic reward expectations. We discuss the normative and neu- robiological roots of these RL biases and their position within the greater picture of behavioral decision-making theories. From belief updating to reinforcement learning Our decisions critically depend on the beliefs we have about the options available to us: their probability of occurrence, conditional on the actions that we undertake, and their value that is, how good they are. It is therefore not surprising that an ever-growing literature in cognitive psychol- ogy and behavioral economics focuses on how humans form and update their beliefs. While Bayesian inference principles provide a normative solution for how beliefs can be optimally updated when we receive new information, in humans, belief-updating behaviors often deviate from this normative benchmark. Among the most prominent systematic deviations, the positivity and the conrmation biases (see Glossary) stand out for their pervasiveness and ecological relevance [1]. The positivity bias characterizes the fact that decision-makers tend to update their beliefs more when new evidence conveys a positive valence [1,2]. This bias has notoriously been revealed in situations where subjects learn something about themselves and preferentially integrate informa- tion that convey a positive signal [e.g., a higher intelligence quotient (IQ) or a lower risk of disease] [35]. The conrmation bias characterizes the fact that decision-makers tend to update their be- liefs more when new evidence conrms their prior beliefs and past decisions compared with when it disconrms or contradicts them [1,6]. This bias can take many forms extending to positive test strategies and selective information sampling and it has been robustly reported in a variety of natural or laboratory experimental setups [6,7]. Of note, in most ecological settings, positivity and conrmatory biases co-occur [8,9]. Indeed, unless a cogent experimental design carefully or- thogonalizes them, we typically hold opinions and select actions that we believe have a positive subjective value (e.g., a higher payoff in economic settings). Therefore, after a better than ex- pected outcome, such actions result in a positive and conrmatory update [3,10,11]. To date, the dominant framework used to explain the existence and persistence of asymmetric belief updating posits that they stem from a rationalcost-benet trade-off. The cost of holding Highlights Human belief updating is pervaded by distortions, such as positivity and conr- mation bias. Experimental evidence from a variety of tasks and collected in different mammal species suggest that these biases also exist in simple reinforcement learning (RL) contexts. Conrmatory RL generates over-opti- mistic reward expectations and aberrant preferred response rates. Counter-intuitively, conrmatory RL ex- hibits statistical advantages over unbi- ased RL in a variety of learning contexts. Conrmatory RL may contribute to di- verse and apparently unrelated behav- ioral phenomena, such as stickiness to the status quo, overcondence, and the persistence of (pathological) gambling. 1 Laboratoire de Neurosciences Cognitives et Computationnelles, Institut National de la Santé et Recherche Médicale, Paris, France 2 Département dÉtudes Cognitives, Ecole Normale Supérieure, Paris, France 3 Université de Recherche Paris Sciences et Lettres, Paris, France 4 Paris School of Economics, Paris, France 5 LabNIC, Department of Fundamental Neurosciences, University of Geneva, Geneva, Switzerland 6 Swiss Center for Affective Science, Ge- neva, Switzerland *Correspondence: stefano.palminteri@ens.fr (S. Palminteri) and mael.lebreton@pse.fr (M. Lebreton). Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx https://doi.org/10.1016/j.tics.2022.04.005 1 © 2022 Elsevier Ltd. All rights reserved. Trends in Cognitive Sciences TICS 2287 No. of Pages 15
objectively wrong (i.e., overly optimistic) beliefs is traded against the psychological benets of them being self-serving: believing in a world that is pleasant and reassuring per se (consumption value) [2,1215]. While originally designed to account for the positivity bias, this logic arguably ex- tends to the conrmation bias, when one considers that being right (as signaled by conrmatory information) is also valuable and self-serving (the ego utility consequences of being right[3]). Im- portantly, both original and more recent versions of this theoretical account of asymmetric belief updating explicitly suggest that this class of learning biases is specic to high-level and ego-rele- vant beliefs [2,12,14,15], a position that seems supported by the fact that the positivity bias (as opposed to the conrmatory bias) does not clearly extend to belief updates that are not ego-rel- evant, such as in purely nancial contexts [10,1619]. In the present article, we review recent empirical and modeling studies that challenge the stan- dard account, and suggest that the asymmetries that affect high-level belief updates are shared with more elementary forms of updates. This set of empirical ndings cannot be purely explained by the dominant, self-serving bias account of asymmetric updates and shows that some forms of positivity and conrmatory biases occur across a wide variety of species and contexts. Testing asymmetric updating in the reinforcement-learning framework Arguably, the RL framework represents the ideal elementary form of motivated belief updating. RL characterizes the behavioral processes that consist of selecting among alternative courses of ac- tion, based on inferred economic (or affective) values that are learned by interacting with the en- vironment [20]. In addition to being computationally simple, elegant, and tractable, the most popular RL algorithms can solve (or be a core component of the solution to) higher-level cognitive tasks, such as spatial navigation, games involving strategic interactions, and even complex video games, thereby consisting an ideal basic building block for higher-level cognitive processes [21,22]. The basic experimental framework of a two-armed bandit task (often referred to as a two-alterna- tive forced-choice task) provides all the key elements necessary to assess the pervasiveness of positivity and conrmation bias. In this simplied set-up, the decision-maker faces two neutral cues, associated with different reward distributions (Figure 1A). In the most popular RL formalism, the decision-maker learns, through an error-correction mechanism, to attach subjective values [Q (:)] to each option, which they use to make later choices (Figure 1B). Concretely, once an option is chosen (c), the decision-maker receives an outcome R. The out- come is compared with its subjective value, generating a prediction error PE c ðÞ¼ Rc ðÞQc ðÞ ½1 The prediction error is then used to update the subjective value of the chosen option via an error correction mechanism involving a weighting parameter, the learning rate: Qc ðÞ Qc ðÞþ α PE c ðÞ ½2 Reframed in terms of belief updating, the magnitude of the prediction error quanties how surpris- ing the experienced outcome is, while its sign (positive or negative) species the valence of the information carried by the experienced outcome. In other words, positive prediction errors follow outcomes that are better than expected (i.e., they signal relative gains or good news), while neg- ative prediction errors follow outcomes that are worse than expected (i.e., they signal relative Trends in Cognitive Sciences 2 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx Glossary Belief-confirmation bias: the ten- dency to overweight or selectively sam- ple information that conrms our own beliefs (what I believe is true). Also referred to as prior-biased updating, belief perseverance, or conservatism, among other nomenclatures. Bias: a feature of a cognitive process that introduces systematic deviations between state of the world and an inter- nal representation. Choice-confirmation bias: the ten- dency to overweight information that conrms our own choice (what I did was right). Learning rate: a model parameter that traditionally indexes the extent to which prediction errors affect future expecta- tions. Model comparison: collection of methods aimed at determining what is the best model in a given dataset com- bining model tting and model simula- tions, to assess, respectively the falsiability of the rejected models and the parsimony of the accepted one. Model fitting: a statistical method aimed at estimating the values of model parameters that maximize the likelihood of observing the empirical data. Model tting is not to be confounded with model comparison (see later). Positivity bias: the tendency to over- weight events with a positive affective valence. In the specic context of RL, it would consist in overweighting positive prediction errors (regardless of them being associated with chosen forgone option). Positivity bias is also sometimes referred to as the good news-bad news effect or preference-biased updating. Prediction error: the discrepancy between an expectation and the reality. In the context of RL, prediction errors are dened as the difference between an expected and an obtained outcome and they therefore have a valence: they are positive when the outcome is better than expected, and they are negative when the outcome is worse than expected.
TICS 2287 No. of Pages 15 Trends in Cognitive Sciences Review The computational roots of positivity and confirmation biases in reinforcement learning Stefano Palminteri 1,2,3, * and Maël Lebreton4,5,6,* Highlights Humans do not integrate new information objectively: outcomes carrying a positive affective value and evidence confirming one’s own prior belief are overweighed. Until recently, theoretical and empirical accounts of the positivity and confirmation biases assumed them to be specific to ‘high-level’ belief updates. We present evidence against this account. Learning rates in reinforcement learning (RL) tasks, estimated across different contexts and species, generally present the same characteristic asymmetry, suggesting that belief and value updating processes share key computational principles and distortions. This bias generates over-optimistic expectations about the probability of making the right choices and, consequently, generates over-optimistic reward expectations. We discuss the normative and neurobiological roots of these RL biases and their position within the greater picture of behavioral decision-making theories. Human belief updating is pervaded by distortions, such as positivity and confirmation bias. Experimental evidence from a variety of tasks and collected in different mammal species suggest that these biases also exist in simple reinforcement learning (RL) contexts. Confirmatory RL generates over-optimistic reward expectations and aberrant preferred response rates. Counter-intuitively, confirmatory RL exhibits statistical advantages over unbiased RL in a variety of learning contexts. From belief updating to reinforcement learning Our decisions critically depend on the beliefs we have about the options available to us: their probability of occurrence, conditional on the actions that we undertake, and their value – that is, how good they are. It is therefore not surprising that an ever-growing literature in cognitive psychology and behavioral economics focuses on how humans form and update their beliefs. While Bayesian inference principles provide a normative solution for how beliefs can be optimally updated when we receive new information, in humans, belief-updating behaviors often deviate from this normative benchmark. Among the most prominent systematic deviations, the positivity and the confirmation biases (see Glossary) stand out for their pervasiveness and ecological relevance [1]. Confirmatory RL may contribute to diverse and apparently unrelated behavioral phenomena, such as stickiness to the status quo, overconfidence, and the persistence of (pathological) gambling. 1 The positivity bias characterizes the fact that decision-makers tend to update their beliefs more when new evidence conveys a positive valence [1,2]. This bias has notoriously been revealed in situations where subjects learn something about themselves and preferentially integrate information that convey a positive signal [e.g., a higher intelligence quotient (IQ) or a lower risk of disease] [3–5]. The confirmation bias characterizes the fact that decision-makers tend to update their beliefs more when new evidence confirms their prior beliefs and past decisions compared with when it disconfirms or contradicts them [1,6]. This bias can take many forms – extending to positive test strategies and selective information sampling – and it has been robustly reported in a variety of natural or laboratory experimental setups [6,7]. Of note, in most ecological settings, positivity and confirmatory biases co-occur [8,9]. Indeed, unless a cogent experimental design carefully orthogonalizes them, we typically hold opinions and select actions that we believe have a positive subjective value (e.g., a higher payoff in economic settings). Therefore, after a better than expected outcome, such actions result in a positive and confirmatory update [3,10,11]. To date, the dominant framework used to explain the existence and persistence of asymmetric belief updating posits that they stem from a ‘rational’ cost-benefit trade-off. The cost of holding Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx Laboratoire de Neurosciences Cognitives et Computationnelles, Institut National de la Santé et Recherche Médicale, Paris, France 2 Département d’Études Cognitives, Ecole Normale Supérieure, Paris, France 3 Université de Recherche Paris Sciences et Lettres, Paris, France 4 Paris School of Economics, Paris, France 5 LabNIC, Department of Fundamental Neurosciences, University of Geneva, Geneva, Switzerland 6 Swiss Center for Affective Science, Geneva, Switzerland *Correspondence: stefano.palminteri@ens.fr (S. Palminteri) and mael.lebreton@pse.fr (M. Lebreton). https://doi.org/10.1016/j.tics.2022.04.005 © 2022 Elsevier Ltd. All rights reserved. 1 Trends in Cognitive Sciences objectively wrong (i.e., overly optimistic) beliefs is traded against the psychological benefits of them being self-serving: believing in a world that is pleasant and reassuring per se (consumption value) [2,12–15]. While originally designed to account for the positivity bias, this logic arguably extends to the confirmation bias, when one considers that being right (as signaled by confirmatory information) is also valuable and self-serving (‘the ego utility consequences of being right’ [3]). Importantly, both original and more recent versions of this theoretical account of asymmetric belief updating explicitly suggest that this class of learning biases is specific to high-level and ego-relevant beliefs [2,12,14,15], a position that seems supported by the fact that the positivity bias (as opposed to the confirmatory bias) does not clearly extend to belief updates that are not ego-relevant, such as in purely financial contexts [10,16–19]. In the present article, we review recent empirical and modeling studies that challenge the standard account, and suggest that the asymmetries that affect high-level belief updates are shared with more elementary forms of updates. This set of empirical findings cannot be purely explained by the dominant, self-serving bias account of asymmetric updates and shows that some forms of positivity and confirmatory biases occur across a wide variety of species and contexts. Testing asymmetric updating in the reinforcement-learning framework Arguably, the RL framework represents the ideal elementary form of motivated belief updating. RL characterizes the behavioral processes that consist of selecting among alternative courses of action, based on inferred economic (or affective) values that are learned by interacting with the environment [20]. In addition to being computationally simple, elegant, and tractable, the most popular RL algorithms can solve (or be a core component of the solution to) higher-level cognitive tasks, such as spatial navigation, games involving strategic interactions, and even complex video games, thereby consisting an ideal basic building block for higher-level cognitive processes [21,22]. The basic experimental framework of a two-armed bandit task (often referred to as a two-alternative forced-choice task) provides all the key elements necessary to assess the pervasiveness of positivity and confirmation bias. In this simplified set-up, the decision-maker faces two neutral cues, associated with different reward distributions (Figure 1A). In the most popular RL formalism, the decision-maker learns, through an error-correction mechanism, to attach subjective values [Q (:)] to each option, which they use to make later choices (Figure 1B). Concretely, once an option is chosen (‘c’), the decision-maker receives an outcome R. The outcome is compared with its subjective value, generating a prediction error PEðcÞ ¼ RðcÞ–QðcÞ ½1Š The prediction error is then used to update the subjective value of the chosen option via an error correction mechanism involving a weighting parameter, the learning rate: Qðc Þ QðcÞ þ α  PEðcÞ ½2Š Reframed in terms of belief updating, the magnitude of the prediction error quantifies how surprising the experienced outcome is, while its sign (positive or negative) specifies the valence of the information carried by the experienced outcome. In other words, positive prediction errors follow outcomes that are better than expected (i.e., they signal relative gains or good news), while negative prediction errors follow outcomes that are worse than expected (i.e., they signal relative 2 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx Glossary Belief-confirmation bias: the tendency to overweight or selectively sample information that confirms our own beliefs (‘what I believe is true’). Also referred to as prior-biased updating, belief perseverance, or conservatism, among other nomenclatures. Bias: a feature of a cognitive process that introduces systematic deviations between state of the world and an internal representation. Choice-confirmation bias: the tendency to overweight information that confirms our own choice (‘what I did was right’). Learning rate: a model parameter that traditionally indexes the extent to which prediction errors affect future expectations. Model comparison: collection of methods aimed at determining what is the best model in a given dataset combining model fitting and model simulations, to assess, respectively the falsifiability of the rejected models and the parsimony of the accepted one. Model fitting: a statistical method aimed at estimating the values of model parameters that maximize the likelihood of observing the empirical data. Model fitting is not to be confounded with model comparison (see later). Positivity bias: the tendency to overweight events with a positive affective valence. In the specific context of RL, it would consist in overweighting positive prediction errors (regardless of them being associated with chosen forgone option). Positivity bias is also sometimes referred to as the good news-bad news effect or preference-biased updating. Prediction error: the discrepancy between an expectation and the reality. In the context of RL, prediction errors are defined as the difference between an expected and an obtained outcome and they therefore have a valence: they are positive when the outcome is better than expected, and they are negative when the outcome is worse than expected. Trends in Cognitive Sciences (A) (B) (C) Trends in Cognitive Sciences Figure 1. Typical behavioral task and computational reinforcement learning framework. (A) A typical trial of a two-armed bandit task. Both the partial and complete feedback condition are presented. Labels in black indicate the objective steps of the trial, while labels in gray indicate the corresponding hidden cognitive processes. (B) Box-and-arrow representation of a reinforcement learning model of a two-armed bandit task. The figure presents a complete feedback task, where both the obtained [i.e., following the chosen option: R(c)] and forgone [i.e., following the unchosen option: R(u)] outcomes are displayed. The figure also presents a ‘full’ model with a learning rate specific to each combination of prediction error (PE) valence (positive ‘+’ or negative ‘–’) and relation to choice (chosen ‘c’ or unchosen ‘u’) [45,46]. (C) A figure of how the learning rates of the full (i.e., a model for a different learning rate for any possible combination of outcome types and prediction error valences) model relate to the those of the confirmation bias model, which bundles together the learning rates for positive obtained and negative forgone (i.e., confirmatory - ‘CON’) prediction errors and the learning rates for negative obtained and positive forgone (i.e., disconfirmatory - ‘DIS’) prediction errors. losses or bad news). In addition, a positive prediction error following the chosen option confirms that the decision-maker was right to pick the current course of action (and the converse is true for a negative prediction error). In theory, it is possible to define two different learning rates, following these two types of prediction errors: Qðc Þ QðcÞ þ  αþ  PEðcÞ, if PEðcÞ > 0 α–  PEðcÞ, if PEðcÞ < 0 ½3Š As a consequence, in this simplified experimental and computational framework, an elementary counterpart of both the positivity and confirmation bias should be reflected in a learning rate asymmetry – that is, in the fact that positive learning rates (α+) are higher than negative ones (α–). In the following sections, we review evidence in favor (or challenging) of the hypothesis that updating biases analogous to the positivity bias and confirmation bias occur in simple RL tasks. Value update biases in reinforcement learning Positivity bias in reinforcement learning About 15 years ago, a few studies incidentally started fitting variants of the Q-learning model to human data collected in simple RL tasks [23–27]. Notably, they fitted Q-learning models with separate learning rates depending on prediction error valence [Q(α±)]. Comparisons between the two Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 3 Trends in Cognitive Sciences learning rates generally revealed a positivity bias (α+ > α–), although sometimes results were mixed across groups or learning phases. Arguably, a strong demonstration of a positivity bias requires three steps, which were usually absent in these incidental observations: first, the Q(α±) model should outperform the standard model with one learning rate Q(α) in a stringent model comparison [24]; second, although allowed to vary across individuals, the comparison of the two learning rates estimated from model fitting should reveal a significant asymmetry on average, such as α+ > α–; third, behavioral data should exhibit at least one qualitative pattern which falsifies the standard model, while being explained by the Q(α±) (see [28] and Box 1 for a survey of the behavioral signatures of the positivity bias). These three levels of demonstration were unambiguously achieved in a recent study investigating asymmetric updating in a simple two-armed bandit task in humans [29]. The fact that individuals update the option values more following positive rather than negative prediction errors leads to optimistic overestimating of reward expectations and a heightened probability of selecting what the decision-maker believes is the best option. Importantly, the key aspects of such optimistic RL were later replicated in fully incentivized experiments, which included various types of outcome ranges, such as gain (+0.5€/0.0€), loss (0.0€/–0.5€), and mixed contexts (+0.5 €/–0.5€) [29,30]. These results confirm that negative prediction errors are downweighted relative to positive prediction errors even when they are associated with actual monetary losses. Moreover, the positivity bias cannot be neutralized, nor reverted, by either increasing the saliency of negative outcomes (0.0 € ➔ –0.5 €) or decreasing the saliency of the positive outcomes (+0.5 € ➔ 0.0 €). Finally, this also tells us that the bias depends on the valence (or sign) of the prediction error and not the outcome. Generalizing the results Since then, several other studies featuring different experimental designs also fitted the Q(α±) model, thus putting learning asymmetry to the test. In a task featuring different regimens of outcome uncertainty, learning rates are typically adaptively modulated as a function of this environmental volatility: learning rates in a volatile condition are higher than those in a stable condition [31]. In addition to this adaptive modulation, a positivity bias can be observed in human participants in both the low- and high-volatility conditions [32]. When the same volatility task features an ‘appetitive’ treatment (winning money vs. nothing) and an ‘aversive’ treatment (getting a mild electric shock vs. nothing), a positivity bias is reported in human participants in all treatments (rewarding and aversive) and conditions (stable and volatile) (Figure 2A) [33]. The positivity bias in learning rates has been found beyond two-armed bandit task contexts, such as in foraging situations [34], in multi-attribute RL (e.g., instantiated by Wisconsin Card Sorting Test [35]), in strategic interactions and multistep decisions with delayed rewards [36], and in learning transitivity relations [37]. These results suggest that the positivity bias is robust to major variations of experimental protocols, from uncertainty about the outcomes (stable vs. volatile) to differences in the nature of the outcomes themselves (e.g., primary, like electric shocks, or secondary, like money) and the extension of the state-transition structure of the task beyond two-armed bandits. It is worth noting, however, that on some occasions, studies failed to find a positivity bias or even reported a negativity bias (α+ < α–) [38–43]. We argue that sources of such inconsistencies could sometimes be found in specific choices concerning model specification that can hinder the identification of a positivity bias (Box 2). Other features of the design, such as mixing instrumental (or ‘free’) choices and Pavlovian (or ‘forced’) trials, may also have blurred the result (see the following section). 4 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx Trends in Cognitive Sciences Box 1. Behavioral signatures of positivity-biased update Here, we illustrate some behavioral signatures that have been associated with the positivity bias in standard reinforcement learning paradigms, focusing on two-armed bandit tasks with partial feedback (see Figure 1A in main text). A first signature associated with positivity bias is reported in ‘stable’ bandits (i.e., situations where the option probabilities and values do not change), specifically in situation where there is no correct option [23]. In such situations, the positivity bias predicts the development of a preferred response rate to a much greater extent compared with the other learning rate patterns (Figure IA). Another signature has been uncovered in ‘reversal’ bandits, that is, tasks where after some time the best option becomes the worst and vice versa. In these situations, the positivity bias first generates a high correct response rate before reversal, then induces a reluctance to switch toward the alternative option in the second phase (post reversal) [62–64] (Figure IB). Both the development of a higher preferred response rate and the reluctance to reverse can be broadly understood as manifestation of the fact that the positivity bias induces choice inertia. Here, the feedback that is supposed make us change our policy, is not taken into account [112]. A third signature of positivity bias, independent from the choice inertia phenomenon, comes from bandits designed to assess risk preferences, by contrasting a risky option (i.e., the option with variable outcome) to a safe one with similar expected value (Figure IC). Crucially, in these kinds of bandits, the alternative patterns of learning rates (unbiased, α+ = α–; and negativity bias, α+ < α–) predict subjects to behave in a risk-avoidant manner. Although prima facie counterintuitive, this result can be understood by considering that outcome sampling can locally generate a negative expectation for the risky option, which may never be corrected (with partial feedback). The positivity bias predicts a certain degree of risk-seeking behavior: a pattern that has often been observed in humans [55,113] (albeit sometimes in interaction with the valence of the decision frame) and frequently in non-human primates [90]. Finally, by inducing an overestimation of reward expectations, both positivity and choice-confirmation biases mechanistically overestimate the subjective probability of making a correct choice. Not only is this prediction weakly confirmed by the observation of widespread patterns of overconfidence in reinforcement learning tasks, but recent results also suggest that individual levels of overconfidence and confirmatory learning are correlated [47,48]. (A) (B) (C) Trends in Cognitive Sciences Figure I. Behavioral signatures of biased updates. The panels display the two-armed bandit task contingencies (top) and simulated choice rates (bottom) as a function of the trial number with three different models (unbiased α+ = α–, positivity bias α+ > α–, and negativity bias α+ < α–). (A) Two-armed bandit task with stable contingencies and no correct response (top) and preferred choice rate (bottom). The preferred choice rate is defined as the choice rate of the option most frequently selected by the simulated subject – by definition, in more than 50% of trials [29,46]. (B) Reversal learning task (top) and correct choice rate (bottom). (C) Risk preference task (top) and risky choice rate (bottom). The curves are obtained simulating the corresponding models using a very broad range of parameters’ values. For each task (‘stable’, ‘reversal’, and ‘risk’) and model (‘unbiased’, ‘positivity’, and ‘negativity’), we simulated 10 000 agents; decisions were implemented using a softmax decision rule. The parameters were drawn from uniform distributions covering all possible values of learning rates to ensure the generality of the results. See github.com/spalminteri/valence_ bias_simulations for full details. From positivity to confirmatory bias The studies surveyed so far all feature what is often referred to as partial feedback conditions, that is, the standard situation where the subject is informed only about the outcome of the chosen option (Figure 1A and [44]). Critically, under this standard set-up, it is not possible to assess whether Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 5 Trends in Cognitive Sciences (B) (A) (C) (D) Trends in Cognitive Sciences Figure 2. Reinforcement learning biases across tasks, species, and outcome types. (A) The panel displays learning rates from Gagne et al. [33] plotted as a function of the nature of the outcomes used in the task (appetitive/money vs. aversive/electric shocks), the volatility of option-outcome contingencies (stable vs. volatility as in [31]), and the prediction error valence (positive ‘+’ vs. negative ‘–’). (B and C) The panel displays learning rates from Farashahi et al. [32] (B) and Ohta et al. [54] (C) plotted as a function of the species (monkeys vs. rats), the volatility of option-outcome contingencies, and the prediction error valence (positive ‘+’ vs. negative ‘–’). (D and E) The panels display the choiceconfirmation bias. The figure displays the learning rates from Chambon et al. [45] (experiment 2 in the paper) of a full model (i.e., a model with a different learning rate for any possible combination of choices, outcomes, and prediction error types) as a function of whether the outcome followed a free (or instrumental) or a forced (or observational) trial; whether the outcome was associated with the obtained or forgone option and, finally, the valence of the prediction error (positive ‘+’ or negative ‘–’). The overall pattern is consistent with a choice-confirmation bias because positive obtained and negative forgone prediction errors are overweighed only if they follow a free choice (D), but not after a forced choice [observation trial; (E)]. Data visualization is as in [111]: horizontal lines represent the mean; the error bars represent the error of the mean; the box, the 95% confidence interval. Finally, the colored area is the distribution of the individual points. the reported positivity bias actually reflects a saliency bias (‘all positive prediction errors are overweighed’) or a choice-confirmation bias (‘only positive prediction errors following obtained outcomes are overweighed’). To tease apart these interpretations, we conducted a series of 6 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx Trends in Cognitive Sciences Box 2. Misidentifying asymmetric update Assessing reinforcement learning update biases relies on estimating learning rates from choice data. Although the logic of such inference is intuitive, fitting and interpreting parameters remain some of the trickiest analytical steps in computational cognitive modeling [28,100,114]. Here, we discuss how the estimation of learning rates asymmetries can be affected by (or mistaken for) apparently neutral choices of model specification and the omission of alternative computational processes. For instance, although counterintuitive, Q value initialization markedly affect learning rate and learning bias estimates. The reason is that the first prediction error plays a very important role in shaping all subsequent responses, especially in designs involving a small number of trials and stable contingencies. For instance, pessimistic initializations (i.e., setting initial Q values lower than the true default expectation) can counter, or reverse, genuine positivity biases, by artificially amplifying the size of the first positive prediction error. Consequently, it is not surprising that many of the papers reporting a negativity bias used pessimistic initialization [38–40] – although not all, (see [59]). Since the effect of priors vanishes after a few trials and in volatile environments, tasks featuring long learning phases and variable contingencies are particularly well suited to tease apart pessimistic initializations from positivity and confirmation biases [33]. It has also recently been proposed that positivity and confirmation biases may spuriously arise by fitting different learning rates to models including an explicit choice-autocorrelation term [104,112]. The choice-autocorrelation term is usually modeled as a (fixed or graded) bias in the choice function toward the option that was previously chosen and is thought to account for the development of a habitual processes [115]. Intuitively, both processes naturally lead to a similar escalation of choice repetition, as successful learning increasingly identify the best option (Box 1). Yet a crucial, conceptual difference is that the autocorrelation is independent of the outcome (i.e., of the prediction error). A recent meta-analysis showed that in nine datasets the choice-confirmation bias is still detectable despite the inclusion of a choice-autocorrelation term [103]. It can be further argued that in the context of the typically short learning task (less than 1 h), developing a strong outcome-independent habit is unlikely. As a consequence, it is possible that studies fitting explicit choice autocorrelation actually missed occurrences of positivity and confirmation bias [27,116–118]. Tasks contrasting a riskier and safer options can tease apart these competing accounts, because only the positivity and confirmation biases predict a preference for the riskier (high variance) options (Box 1) [55,90]. studies leveraging complete feedback conditions, that consist of also displaying the forgone (or counterfactual) outcome, that is, the outcome associated with the unchosen option in a twoarmed bandit task [45,46]. Under the saliency bias hypothesis, one expects larger learning rates for positive prediction errors, independent of them being associated with the chosen or unchosen option. Under the confirmation bias hypothesis, one expects an interaction between the valence of the prediction error and its association with the chosen or the unchosen option (Figure 1B). The rationale is that a better-than-expected forgone outcome can be interpreted as a relative loss, as it indicates that the alternative course of action could have been beneficial (a disconfirmatory signal). Symmetrically, a worse-than-expected forgone outcome can be interpreted as a relative gain as it indicates that the current course of action is advantageous (a confirmatory signal). In a recent study that explicitly and systematically exploited this rationale, we observed the interaction characterizing the confirmation bias hypothesis: positive and negative learning rates associated with the unchosen option mirrored the learning rates associated with the chosen option (Figure 3D; left). Additional model comparison analyses showed that the four learning rate model could be reduced to a two learning rate model, featuring a single parameter for all confirmatory and disconfirmatory feedback, respectively (Figure 1C). The symmetrical pattern of learning rates, as well as the superiority of this implementation of choice confirmation bias against other models, has been replicated several times in RL tasks that include both partial and complete feedback information [47–49]. In a follow-up study that further investigated the choice-related aspects of the positivity bias, standard instrumental trials were interleaved with observational trials, where participants observed the computer making a choice for them and the resulting outcome [45]. Results from model fitting and model comparison indicated that the update bias was specific to freely chosen outcomes, further corroborating the presence of a proper choice-confirmation bias (Figure 3D; right). Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 7 Trends in Cognitive Sciences Importantly, the fact that agency seems mandatory to observe the choice-confirmation bias [45,50] is reminiscent of the ego-relevance aspect of belief-updating biases. Finally, several studies have experimentally manipulated participants’ beliefs about the option values through task instructions (e.g., by explicitly indicating option values to the participants before the beginning of the experiment) [51,52]). Behavioral results in this task are consistent with a model that assumes that the usual learning asymmetry is further exacerbated by the (instructed) prior about the option value, such that positive prediction errors following options with a positive prior are overweighted (and the reverse is true for options with a negative prior). Therefore, the available evidence is consistent with the idea that belief-confirmation bias can be induced in (A) (B) Trends in Cognitive Sciences Figure 3. Optimality of the learning rate biases. The figure displays the simulation results recently reported in Lefebvre et al. [62]. Performance of the model is expressed as the average reward per trial obtained by the artificial agents and is indexed by a colored gradient so that the yellow represents the highest values. Artificial agents are simulated playing a two-armed bandit task, using an exhaustive range of model parameters (learning rates) and across different task conditions. ‘Partial feedback’ refers to simulations where only the feedback of the chosen outcome is disclosed to the agent, while ‘complete feedback’ refers to simulations where both the obtained and forgone outcomes are disclosed to the agents. ‘Rich task’ refers to simulations in which both options have an overall positive expected value, while ‘poor task’ indicates the opposite configuration. ‘Stable task’ refers to simulations featuring a good option (positive expected value) and a bad option (negative expected value), whose values do not change across time. On the contrary, ‘volatile task’ refers to simulations in which the options switched from good to bad (and vice versa) three times during the learning period. Performance is plotted as a function of the learning rates. Cells above the diagonal correspond to positivity bias (‘partial feedback’) or a confirmation bias (‘complete feedback’). The cell with a black circle indicates the best possible unbiased (or symmetric) combination of learning rates (in terms of average reward per trial). Cells surrounded by black lines indicate the biased (or asymmetric) combinations of learning rates that obtain a higher reward rate compared with the best unbiased combination (see the original paper for more details; adapted with permission from [62]). 8 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx Trends in Cognitive Sciences the context of RL via semantic instructions, thus suggesting a permeability between cognitive representations and instrumental associations. Positivity and confirmation biases across evolution and development A valuable aspect of RL tasks in general (and n-armed bandits in particular) is that they are routinely used in non-human research, opening up the possibility of testing the comparative validity of the positivity bias results. To our knowledge, to date, few studies have tested the dual-learning model in other species. Among those few, one study featuring stable and volatile phases tested both humans and rhesus monkeys (Macaca mulatta) with the same task [32]. Like humans, monkeys displayed a positivity bias, whose size was, if anything, larger than that observed in humans (Figure 2B and Box 1 for possible behavioral consequences). A couple of recent studies in rodents (Rattus norvegicus) also provide support for the positivity bias [53,54]. In addition, they suggest that the bias could be modulated by factors such as the stage of learning (the bias being larger in the exploratory phase) and the overall value of the decision problem (the bias being larger in ‘poor’ environments) (Figure 2C). Regarding the developmental aspects of positivity and confirmation bias, a series of recent studies investigated learning behavior in a simple two-armed bandit task in cohorts including children and young adults. While most of these studies actually report a positivity bias in all age groups [55–58] (but see [59]), they draw conflicting conclusions regarding the developmental trajectories of the bias. Further studies are therefore required to better assess the trajectory of these biases during development and aging, as well as identify the individual traits and tendencies that promote or counteract them. Is confirmatory updating a flaw or a desirable feature of reinforcement learning? The presence of update bias (such as the positivity and the confirmation bias) in basic RL across species and contexts naturally raises the question of why evolution has selected and maintained what can be perceived, prima facie, as error-introducing processes that generate apparently irrational behavioral tendencies (Box 1). Statistical normativity of choice-confirmation bias Early simulations restricted to specific task contingencies and partial feedback regimens demonstrated that a positivity bias is optimal in learning contexts with a low overall reward rate (‘poor’ environments) but detrimental in learning contexts with a high overall reward rate (‘rich’ environments) [60]. This result can be intuitively understood as a consequence of the fact that, in partial feedback situations, it is rational to preferentially take into account the prediction errors that are rare (i.e., positive prediction errors in ‘poor’ environments and negative prediction errors in ‘rich’ environments) (Figure 3A). However, to date, experimental data have not provided convincing evidence in favor of an inversion of the learning bias as a function of task demands [39,45] (but see [55] for a partial adaptation). Accordingly, a positivity bias following partial feedback is maintained in tasks involving contingency reversals and volatility [33,46], even though these reduce the learner’s capacity to quickly adapt their responses in these conditions (Box 1). However, the fact that the positivity bias appears maladaptive in some (laboratory based) conditions does not rule out the possibility that it has been selected and maintained by evolution because it could still be adaptive in most ecologically relevant scenarios [61]. Indeed, the fact that the bias is documented in several species, suggests that its statistical advantages should apply across a broad range of ecological contexts. A recent study systematic analyzed the performance of the choice-confirmation bias in complete feedback contexts to clarify its statistical properties. Specifically, the study assessed its optimality Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 9 Trends in Cognitive Sciences in a larger space of learning problems, including ‘rich’ and ‘poor’, ‘stable’, and ‘volatile’ environments, as well as more demanding decision problems [62]. The authors reported that confirmatory-biased RL algorithms generally outperform their unbiased counterparts (Figure 3B). This counterintuitive result, replicated by other simulation studies, arises from the fact that confirmatory RL algorithms mechanistically neglect uninformative – stochastic – negative prediction errors associated with the best response. Thereby they accumulate resources (i.e., collect rewards and avoid losses) more efficiently than their unbiased counterparts [62–64]. Thus, confirmatory updating appears to facilitate and optimize learning and performance in a broad range of learning situations [61,65]. Metacognitive efficiency potentiates the positivity bias Finally, positivity and confirmatory bias may be normative or advantageous in combination with other features of cognition. Supporting this idea, recent work proposes that learning biases are normative when coupled with efficient metacognition [66]. This is because when one can efficiently tease apart one’s own correct decisions from one’s mistakes, the probabilistic negative feedback (that sometimes inevitably follows correct choices) can be neglected. This creates a normative ground for positivity and confirmation biases. Note that this mechanism might not be restricted to humans, as efficient metacognition has been reported in animals, from non-human primates to rodents [67,68]. A challenge to this idea lies in the fact that learning biases and metacognitive (in)efficiencies might not be independent. Indeed, a yet unpublished study shows that in a two-armed bandit task where confidence in choice is elicited, the confirmation bias can cause overconfidence, which is a metacognitive bias [48]. While these findings challenge the idea that metacognition ensures that updating biases are normative, they might connect the asymmetric updating observed in RL to the original theoretical accounts of asymmetric belief updating, if overconfidence (i.e., the metacognitive illusion of accuracy) is considered self-serving per se, that is, carries an ego-relevant utility [15,69]. In conclusion, although this section reviewed the evidence that learning asymmetry may be normative in some contexts – and as such may provide justification for its selection in that context – its persistence in contexts where it is unfavorable along with its lack of modulation in many circumstances reinforce the idea that learning asymmetry constitutes a hardcoded learning bias [39,45,54,55]. A complementary perspective on the normativity of this bias could emerge from different modeling perspectives. For example, a recent unpublished study suggests that asymmetric updating can be derived from Bayesian-optimal principles [70]. Neuronal bases Neural circuits for biased updating An important question concerns the neurobiological bases of positivity and confirmatory bias in RL [71]. A prerequisite to answering this question is a consensus concerning the neural bases of RL, per se. The dominant hypothesis, stemming from the repeated and robust electrophysiological and pharmacological observations, postulates that reinforcement is instantiated by dopaminergic modulation of corticostriatal synapses [72–75]. A neural model of biased (or asymmetric) updates then further requires that the neural channels for positive and negative prediction errors are dissociable. In line with this assumption, anatomically plausible neural network models of corticostriatal circuits suggest that positive and negative reinforcements are mediated by specific subpopulations of striatal neurons, which exhibit different receptors with excitatory (D1) or inhibitory (D2) properties [76]. These models (as well as their more recent developments [77,78]), can therefore support, in principle, asymmetric updating, by implementing the processing of positive 10 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx Trends in Cognitive Sciences and negative reinforcements in different neurobiological pathways. Crucially, recent extensions of these models also account for the absence of biases following observational trials and its exacerbation induced by instruction priors [50,51]. A conceptually similar but structurally different neural network model put forward an alternative theory, which suggests a key computational role for metaplasticity in the generation of update biases [79]. While the metaplasticity framework does not necessitate the emergence of a positivity bias, this bias naturally emerges under most outcome contingencies and confirms its advantageous properties [62–64,80]. Neural signatures in human studies Several lines of evidence suggest that the neurotransmitter dopamine and a basal ganglia structure, the striatum, govern the relative sensitivity to positive and negative prediction errors. First, in both healthy and neurological patients, dopaminergic modulation affects the learning rate bias, such that higher dopamine is associated with a higher positivity bias [81–84]. Second, in healthy subjects, interindividual differences in positivity bias are associated with higher striatal activation in response to rewards [29]. Interindividual differences in the positive bias have also been associated with pupil dilation (another physiological proxy of neuromodulator activity during outcome presentation in classic two-armed bandit tasks) [85]. Finally, the choice-confirmation bias model supposes that positive and negative predictions associated, respectively, with obtained and forgone outcomes, are treated by the same learning rate as confirmatory signals. fMRI studies of two-armed bandit tasks with complete feedback (Figure 1A) confirm that obtained and forgone outcome signals are both encoded in the dopaminergic striatum, with opposite signs, thereby suggesting that the neurocomputational role currently attributed to this structure can be extended to accommodate the choice-confirmation bias without major structural changes [86,87]. Loss aversion versus loss neglect Overall, the studies reviewed here suggest that in RL, outcomes are processed in a choice-confirmatory manner. This bias takes the form of a selective neglect of losses (i.e., obtained punishments and forgone rewards) relative to gains (i.e., obtained rewards and forgone punishments) when updating outcome expectations. Superficially, this pattern seems in stark contrast with a vast literature in behavioral economics revolving around the notion of loss aversion [88]. According to loss aversion, prospective losses loom greater than corresponding gains in determining individuals’ economic choices [89]. In the RL framework, this valuation asymmetry would directly translate into the negative prediction error having a larger relative influence on value expectation. Consequently, the choice-confirmation bias observed in RL does not align, at least prima facie, with dominant behavioral economics theories, potentially representing an additional instance of the experience-description gap [44,90] (Figure 4). However, a more in-depth consideration of the processes at stake may help reconcile these apparently contradictory findings. First, loss aversion pertains to the calculation of subjective decision values, while loss neglect, in the context of RL, applies to the retrospective subjective assessment of experienced outcomes. It is well known that different heuristics and biases apply to expected and experience utilities [91,92]. Second, most of the findings reviewed here, although properly incentivized, use relatively small outcomes (primary or secondary). Evidence in behavioral science and economics suggests that the utility function may display specific features in the range of small amounts usually involved in RL studies, making them unsuited to test – and to challenge – the general structure of loss aversion [93,94] (but note some recent studies claim that loss aversion also extends to small outcomes [95]). Finally, it is worth noting that prospective loss aversion and retrospective loss neglect, although superficially antithetic, provide complementary explanations for the status Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 11 Trends in Cognitive Sciences Outstanding questions What are the boundaries and limits of positivity and choice-confirmation bias? Important questions remain, pertaining to, for example, the persistence of those biases in learning contexts where outcome distributions are not binomial (such as continuous outcomes – drifting bandits), or when outcome distributions include (very) high monetary stakes. Likewise, the existence and potential signatures of those biases in complex (multistep, multi-attribute) tasks are still to be investigated. Trends in Cognitive Sciences Figure 4. Loss aversion versus loss neglect. This figure exemplifies the crucial computational differences between ‘loss aversion’ and ‘loss neglect’. The former applies to decisions between explicit (or described) options, often referred to also as ‘prospects’ and experimentally instantiated by lotteries. The latter applies to decisions between options (often experimentally instantiated by bandits) whose values have been learned by trial and error (or experience). In the former case, the slope in the loss domain, which determines the relation between subjective and objective values, corresponds to the loss aversion parameter. In the latter case, the slopes in the positive and negative domains determine the extent at which an option estimate (Q value) is updated as a function of the prediction error; the slopes correspond to the learning rates for positive and negative prediction errors, respectively. quo bias. While loss aversion would explain the bias by the fear of losing current assets [94,96,97], loss neglect rather posits that we disregard the feedback that suggests we made a wrong decision (Box 1). Retrospective loss neglect (or choice-confirmation bias), however, provides a putative, new computational explanation for the puzzling phenomenon of (pathological) gambling, which is difficult to accommodate with loss aversion (see Figure IC in Box 1) [98,99]. Concluding remarks The evidence reviewed here suggests that, contrary to what was previously thought [2,69], positivity and confirmation biases permeate RL, leading to an over-optimistic estimation of outcome expectations. This results in characteristic behavioral consequences (Box 1), that may explain phenomena such as choice inertia (or status quo bias) and risky decision-making (gambling). Empirical investigations of the choice-confirmation bias in RL have mostly relied on inferring model parameters from choice data. Therefore, no matter how carefully this inferential process is carried out [100], it is still conceivable that a surrogate, spurious computational process is responsible for the observed patterns of behavioral and neurobiological results. While we believe the current competing interpretations are not supported by available experimental evidence (Box 2 and [101–106]), future research should carefully combine model fitting and clever designs, to provide unambiguous evidence for the neurocomputational mechanisms of positivity and confirmatory biases [28]. Recently, a stream of studies from cognitive (neuro)science has described behavioral patterns consistent with this emerging account of positivity and confirmatory bias. Indeed, confirmation bias was recently described in a simple perceptual task [107,108], within the time-evolving dynamic of the decision [109]. Crucially, in this latter case, the act of choosing was critical to the 12 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx What are the precise computational mechanisms underlying positivity and choice-confirmation bias? Further research should elucidate whether those biases are rather generated by an absolute overweighting of positive/ confirmatory feedback, an absolute underweighting of negative/ disconfirmatory feedback, or simply a relative imbalance between the two. Which cognitive processes contribute or affect the choice-confirmation bias? Links with selective attention and selective memory might be of special interest, as they are both suspected to have a key role in biasing high-order belief updating. What could be the macroscopic consequences of these reinforcement biases? We anticipate that two directions might be particularly fruitful: the clinical direction, where the biases could have a role in several types of addiction and pathological gambling; and the social sciences direction, where the elementary RL biases could be connected with general social phenomena like opinion polarization. What could be the benefit of including these RL biases in artificial intelligent agents? Simulation studies show that the confirmation biases is statistically optimal in simple two-armed bandit tasks, but what about more complex learning problems? What about more ecological situations? Trends in Cognitive Sciences expression of the bias [110]. These findings suggest that confirmation bias is not purely a reflection of a high-level reasoning bias, nor restricted to the domain of abstract, semantic beliefs. In sum, a growing body of empirical studies in humans and animals reveal that the asymmetries that affect high-level belief updates are shared with more elementary forms of updates, notably in the form of the choice-confirmation bias observed in RL. Whether those update asymmetries are caused by shared neurocomputational mechanisms, or whether they have emerged independently in two separate pathways remains an open question (see Outstanding questions). Finally, at the conceptual level, it seems that important links between concepts of agency, metacognition, and ego-relevance could help reconciliate fundamental aspects of belief and value update asymmetries. Acknowledgments S.P. and M.L. thank Germain Lefebvre, Nahuel Salem-Garcia, Valerian Chambon, and Héloise Théro for stimulating discussions and for leading most of the experimental work that nurtured these ideas over the last years. S.P. and M.L. thank Zoe Koopmans for proof reading the manuscript. S.P. and M.L. thank Alireza Soltani, Hiroyuki Ohta, Sonia Bishop, Christopher Gagne, and Germain Lefebvre for providing material for the figures. S.P. is supported by the Institut de Recherche en Santé Publique (IRESP, grant number: 20II138-00), and the Agence National de la Recherche (CogFinAgent: ANR-21-CE23-000202; RELATIVE: ANR-21-CE37-0008-01; RANGE: ANR-21-CE28-0024-01). The Departement d’Études Cognitives is supported by the Agence National de la Recherche (ANR; FrontCog ANR-17-EURE-0017). M.L. is supported by a Swiss National Science Foundation (SNSF) Ambizione grant (PZ00P3_174127) and an European research Council (ERC) Starting Grant (INFORL-948671). Declaration of interests No interests are declared. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Benjamin, D.J. (2019) Errors in probabilistic reasoning and judgment biases. In Handbook of Behavioral Economics: Applications and Foundations (Bernheim, B.D. et al., eds), pp. 69–186, North-Holland Sharot, T. and Garrett, N. (2016) Forming beliefs: why valence matters. Trends Cogn. Sci. 20, 25–33 Eil, D. and Rao, J.M. (2011) The good news-bad news effect: asymmetric processing of objective information about yourself. Am. Econ. J. Microecon. 3, 114–138 Kuzmanovic, B. et al. (2018) Influence of vmPFC on dmPFC predicts valence-guided belief formation. J. Neurosci. 38, 7996–8010 Sharot, T. et al. (2011) How unrealistic optimism is maintained in the face of reality. Nat. Neurosci. 14, 1475–1479 Klayman, J. (1995) Varieties of confirmation bias. In The Psychology of Learning and Motivation (Busemeyer, J. et al., eds), pp. 385–418, Academic Press Nickerson, R.S. (1998) Confirmation bias: a ubiquitous phenomenon in many guises. Rev. Gen. Psychol. 2, 175–220 Eskreis-Winkler, L. and Fishbach, A. (2019) Not learning from failure—the greatest failure of all. Psychol. Sci. 30, 1733–1744 Staats, B.R. et al. (2018) Maintaining beliefs in the face of negative news: the moderating role of experience. Manag. Sci. 64, 804–824 Coutts, A. (2019) Good news and bad news are still news: experimental evidence on belief updating. Exp. Econ. 22, 369–395 Tappin, B.M. et al. (2017) The heart trumps the head: desirability bias in political belief revision. J. Exp. Psychol. Gen. 146, 1143 Bénabou, R. and Tirole, J. (2016) Mindful economics: the production, consumption, and value of beliefs. J. Econ. Perspect. 30, 141–164 Loewenstein, G. and Molnar, A. (2018) The renaissance of belief-based utility in economics. Nat. Hum. Behav. 2, 166–167 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. Sharot, T. et al. (2021) Why and when beliefs change: a multiattribute value-based decision problem. PsyArXiv Published online November 4 2021. https://doi.org/10.31234/osf.io/ q75ej Bénabou, R. and Tirole, J. (2002) Self-confidence and personal motivation. Q. J. Econ. 117, 871–915 Kuhnen, C.M. and Knutson, B. (2011) The influence of affect on beliefs, preferences, and financial decisions. J. Financ. Quant. Anal. 46, 605–626 Barron, K. (2021) Belief updating: does the ‘good-news, badnews’ asymmetry extend to purely financial domains? Exp. Econ. 24, 31–58 Kuhnen, C.M. (2015) Asymmetric learning from financial information. J. Finan. 70, 2029–2062 Buser, T. et al. (2018) Responsiveness to feedback as a personal trait. J. Risk Uncertain. 56, 165–192 Sutton, R.S. and Barto, A.G. (1998) Reinforcement Learning: An Introduction, Cambridge University Press Botvinick, M. et al. (2019) Reinforcement learning, fast and slow. Trends Cogn. Sci. 23, 408–422 Hassabis, D. et al. (2017) Neuroscience-inspired artificial intelligence. Neuron 95, 245–258 Aberg, K.C. et al. (2016) Linking individual learning styles to approach-avoidance motivational traits and computational aspects of reinforcement learning. PLoS One 11, e0166675 Chase, H.W. et al. (2010) Approach and avoidance learning in patients with major depression and healthy controls: relation to anhedonia. Psychol. Med. 40, 433–440 Frank, M.J. et al. (2007) Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proc. Natl. Acad. Sci. U. S. A. 104, 16311–16316 Kahnt, T. et al. (2009) Dorsal striatal–midbrain connectivity in humans predicts how reinforcements are used to guide decisions. J. Cogn. Neurosci. 21, 1332–1345 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 13 Trends in Cognitive Sciences 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 14 den Ouden, H.E.M. et al. (2013) Dissociable effects of dopamine and serotonin on reversal learning. Neuron 80, 1090–1100 Palminteri, S. et al. (2017) The importance of falsification in computational cognitive modeling. Trends Cogn. Sci. 21, 425–433 Lefebvre, G. et al. (2017) Behavioural and neural characterization of optimistic reinforcement learning. Nat. Hum. Behav. 1, 1–9 Ting, C.-C. et al. (2021) The elusive effects of incidental anxiety on reinforcement-learning. J. Exp. Psychol. Learn. Mem. Cogn. Published online September 13, 2021. https://doi.apa.org/doi/ 10.1037/xlm0001033 Behrens, T.E.J. et al. (2007) Learning the value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 Farashahi, S. et al. (2019) Flexible combination of reward information across primates. Nat. Hum. Behav. 3, 1215–1224 Gagne, C. et al. (2020) Impaired adaptation of learning to contingency volatility in internalizing psychopathology. eLife 9, e61387 Garrett, N. and Daw, N.D. (2020) Biased belief updating and suboptimal choice in foraging decisions. Nat. Commun. 11, 3417 Steinke, A. et al. (2020) Parallel model-based and model-free reinforcement learning for card sorting performance. Sci. Rep. 10, 15464 Nioche, A. et al. (2019) Coordination over a unique medium of exchange under information scarcity. Palgrave Commun. 5, 1–11 Ciranka, S. et al. (2022) Asymmetric reinforcement learning facilitates human inference of transitive relations. Nat. Hum. Behav. 6, 555–564 Christakou, A. et al. (2013) Neural and psychological maturation of decision-making in adolescence and young adulthood. J. Cogn. Neurosci. 25, 1807–1823 Gershman, S.J. (2015) Do learning rates adapt to the distribution of rewards? Psychon. Bull. Rev. 22, 1320–1327 Niv, Y. et al. (2012) Neural prediction errors reveal a risk-sensitive reinforcement-learning process in the human brain. J. Neurosci. 32, 551–562 Pulcu, E. and Browning, M. (2017) Affective bias as a rational response to the statistics of rewards and punishments. eLife 6, e27879 Wise, T. and Dolan, R.J. (2020) Associations between aversive learning processes and transdiagnostic psychiatric symptoms in a general population sample. Nat. Commun. 11, 4179 Wise, T. et al. (2019) A computational account of threat-related attentional bias. PLoS Comput. Biol. 15, e1007341 Hertwig, R. and Erev, I. (2009) The description–experience gap in risky choice. Trends Cogn. Sci. 13, 517–523 Chambon, V. et al. (2020) Information about action outcomes differentially affects learning from self-determined versus imposed choices. Nat. Hum. Behav. 4, 1067–1079 Palminteri, S. et al. (2017) Confirmation bias in human reinforcement learning: evidence from counterfactual feedback processing. PLoS Comput. Biol. 13, e1005684 Lebreton, M. et al. (2019) Contextual influence on confidence judgments in human reinforcement learning. PLoS Comput. Biol. 15, e1006973 Salem-Garcia, N.A. et al. (2021) The computational origins of confidence biases in reinforcement learning. PsyArXiv Published online July 6, 2021. https://doi.org/10.31234/osf.io/ dpqj6 Schüller, T. et al. (2020) Decreased transfer of value to action in Tourette syndrome. Cortex 126, 39–48 Cockburn, J. et al. (2014) A reinforcement learning mechanism responsible for the valuation of free choice. Neuron 83, 551–557 Doll, B.B. et al. (2009) Instructional control of reinforcement learning: a behavioral and neurocomputational investigation. Brain Res. 1299, 74–94 Doll, B.B. et al. (2011) Dopaminergic genes predict individual differences in susceptibility to confirmation bias. J. Neurosci. 31, 6188–6198 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. Harris, C. et al. (2020) Unique features of stimulus-based probabilistic reversal learning. bioRxiv Published online September 25, 2022. https://doi.org/10.1101/2020.09.24.310771 Ohta, H. et al. (2021) The asymmetric learning rates of murine exploratory behavior in sparse reward environments. Neural Netw. 143, 218–229 Nussenbaum, K. et al. (2021) Flexibility in valenced reinforcement learning computations across development. PsyArXiv Published online November 16, 2021. https://doi.org/10. 31234/osf.io/5f9uc Chierchia, G. et al. (2021) Choice-confirmation bias in reinforcement learning changes with age during adolescence. PsyArXiv Published online October 6, 2021. https://doi.org/ 10.31234/osf.io/xvzwb Habicht, J. et al. (2021) Children are full of optimism, but those rose-tinted glasses are fading—Reduced learning from negative outcomes drives hyperoptimism in children. J. Exp. Psychol. Gen. Published online December 30, 2021. https:// doi.apa.org/doi/10.1037/xge0001138 Xia, L. et al. (2021) Modeling changes in probabilistic reinforcement learning during adolescence. PLoS Comput. Biol. 17, e1008524 Rosenbaum, G.M. et al. (2022) Valence biases in reinforcement learning shift across adolescence and modulate subsequent memory. eLife 11, e64620 Cazé, R.D. and van der Meer, M.A.A. (2013) Adaptive properties of differential learning rates for positive and negative outcomes. Biol. Cybern. 107, 711–719 Gigerenzer, G. and Selten, R. (2002) Bounded Rationality: The Adaptive Toolbox, MIT Press Lefebvre, G. et al. (2022) A normative account of confirmation bias during reinforcement learning. Neural Comput. 34, 307–337 Kandroodi, M.R. et al. (2021) Optimal reinforcement learning with asymmetric updating in volatile environments: a simulation study. bioRxiv Published online February 16, 2021. https://doi. org/10.1101/2021.02.15.431283 Tarantola, T. et al. (2021) Confirmation bias optimizes reward learning. bioRxiv Published online March 11, 2021. https:// doi.org/10.1101/2021.02.27.433214 Summerfield, C. and Tsetsos, K. (2020) Rationality and efficiency in human decision-making. In The Cognitive Neurosciences VII (Gazzaniga, M., ed.), pp. 427–438, MIT Press Rollwage, M. and Fleming, S.M. (2021) Confirmation bias is adaptive when coupled with efficient metacognition. Philos. Trans. R. Soc. B Biol. Sci. 376, 20200131 Joo, H.R. et al. (2021) Rats use memory confidence to guide decisions. Curr. Biol. 31, 4571–4583.e4 Kepecs, A. and Mainen, Z.F. (2012) A computational framework for the study of confidence in humans and animals. Philos. Trans. R. Soc. B Biol. Sci. 367, 1322–1337 Sharot, T. et al. (2021) Why and when beliefs change: a multiattribute value-based decision problem. PsyArXiv Published online November 4, 2021. https://doi.org/10.31234/osf.io/ q75ej Kobayashi, T. (2021) Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization. ArXiv Published online May 27, 2021. https://doi.org/10.48550/arXiv.2105. 12991 Palminteri, S. and Pessiglione, M. (2017) Opponent brain systems for reward and punishment learning: causal evidence from drug and lesion studies in humans. In Decision Neuroscience (Dreher, J.-C. and Tremblay, L., eds), pp. 291–303, Academic Press Bayer, H.M. and Glimcher, P.W. (2005) Midbrain dopamine neurons encode a quantitative reward prediction error signal. Neuron 47, 129–141 Dayan, P. (2012) Twenty-five lessons from computational neuromodulation. Neuron 76, 240–256 Di Chiara, G. (1999) Drug addiction as dopamine-dependent associative learning disorder. Eur. J. Pharmacol. 375, 13–30 Schultz, W. et al. (1997) A neural substrate of prediction and reward. Science 275, 1593–1599 Frank, M.J. (2006) Hold your horses: a dynamic computational role for the subthalamic nucleus in decision making. Neural Netw. 19, 1120–1136 Trends in Cognitive Sciences 77. 78. 79. 80. 81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. Collins, A.G.E. and Frank, M.J. (2014) Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentive. Psychol. Rev. 121, 337–366 van Swieten, M.M.H. and Bogacz, R. (2020) Modeling the effects of motivation on choice and learning in the basal ganglia. PLoS Comput. Biol. 16, e1007465 Soltani, A. et al. (2006) Neural mechanism for stochastic behaviour during a competitive game. Neural Netw. 19, 1075–1090 Farashahi, S. et al. (2017) Metaplasticity as a neural substrate for adaptive learning and choice under uncertainty. Neuron 94, 401–414.e6 Frank, M.J. et al. (2004) By carrot or by stick: cognitive reinforcement learning in Parkinsonism. Science 306, 1940–1943 McCoy, B. et al. (2019) Dopaminergic medication reduces striatal sensitivity to negative outcomes in Parkinson’s disease. Brain 142, 3605–3620 Palminteri, S. et al. (2009) Pharmacological modulation of subliminal learning in Parkinson’s and Tourette’s syndromes. Proc. Natl. Acad. Sci. U. S. A. 106, 19179–19184 Pessiglione, M. et al. (2006) Dopamine-dependent prediction errors underpin reward-seeking behaviour in humans. Nature 442, 1042–1045 Slooten, J.C.V. et al. (2018) How pupil responses track valuebased decision-making during and after reinforcement learning. PLoS Comput. Biol. 14, e1006632 Li, J. and Daw, N.D. (2011) Signals in human striatum are appropriate for policy update rather than value prediction. J. Neurosci. 31, 5504–5511 Klein, T.A. et al. (2017) Learning relative values in the striatum induces violations of normative decision making. Nat. Commun. 8, 16033 Ruggeri, K. et al. (2020) Replicating patterns of prospect theory for decision under risk. Nat. Hum. Behav. 4, 622–633 Kahneman, D. and Tversky, A. (1979) Prospect theory: an analysis of decision under risk. Econometrica 47, 263 Garcia, B. et al. (2021) The description–experience gap: a challenge for the neuroeconomics of decision-making under uncertainty. Philos. Trans. R. Soc. B Biol. Sci. 376, 20190665 Kahneman, D. and Tversky, A. (2000) Choices, Values, and Frames, Cambridge University Press Kahneman, D. et al. (1997) Back to Bentham? Explorations of experienced utility. Q. J. Econ. 112, 375–406 Yechiam, E. (2019) Acceptable losses: the debatable origins of loss aversion. Psychol. Res. 83, 1327–1339 Anderson, C.J. (2003) The psychology of doing nothing: forms of decision avoidance result from reason and emotion. Psychol. Bull. 129, 139–167 Sokol-Hessner, P. and Rutledge, R.B. (2019) The psychological and neural basis of loss aversion. Curr. Dir. Psychol. Sci. 28, 20–27 Jachimowicz, J.M. et al. (2019) When and why defaults influence decisions: a meta-analysis of default effects. Behav. Public Policy 3, 159–186 Kahneman, D. et al. (1991) Anomalies: the endowment effect, loss aversion, and status quo bias. J. Econ. Perspect. 5, 193–206 Fauth-Bühler, M. et al. (2017) Pathological gambling: a review of the neurobiological evidence relevant for its classification as an addictive disorder. Addict. Biol. 22, 885–897 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112. 113. 114. 115. 116. 117. 118. Clark, L. et al. (2019) Neuroimaging of reward mechanisms in Gambling disorder: an integrative review. Mol. Psychiatry 24, 674–693 Wilson, R.C. and Collins, A.G. (2019) Ten simple rules for the computational modeling of behavioral data. eLife 8, e49547 Agrawal, V. and Shenoy, P. (2021) Tracking what matters: a decision-variable account of human behavior in bandit tasks. Proceedings of the 43rd Annual Meeting of the Cognitive Science Society, virtual meeting Harada, T. (2020) Learning from success or failure? – Positivity biases revisited. Front. Psychol. 11, 1627 Palminteri, S. (2021) Choice-confirmation bias and gradual perseveration in human reinforcement learning. PsyArXiv Published online July 6, 2021. https://doi.org/10.31234/osf.io/ dpqj6 Sugawara, M. and Katahira, K. (2021) Dissociation between asymmetric value updating and perseverance in human reinforcement learning. Sci. Rep. 11, 3574 Tano, P. et al. (2017) Variability in prior expectations explains biases in confidence reports. bioRxiv Published online April 13, 2017. https://doi.org/10.1101/127399 Zhou, C.Y. et al. (2020) Devaluation of unchosen options: a Bayesian account of the provenance and maintenance of overly optimistic expectations. CogSci. 42, 1682–1688 Rajsic, J. et al. (2015) Confirmation bias in visual search. J. Exp. Psychol. Hum. Percept. Perform. 41, 1353–1364 Rollwage, M. et al. (2020) Confidence drives a neural confirmation bias. Nat. Commun. 11, 2634 Talluri, B.C. et al. (2018) Confirmation bias through selective overweighting of choice-consistent evidence. Curr. Biol. 28, 3128–3135.e8 Talluri, B.C. et al. (2021) Choices change the temporal weighting of decision evidence. J. Neurophysiol. 125, 1468–1481 Bavard, S. et al. (2021) Two sides of the same coin: beneficial and detrimental consequences of range adaptation in human reinforcement learning. Sci. Adv. 7, eabe0340 Katahira, K. (2018) The statistical structures of reinforcement learning with asymmetric value updates. J. Math. Psychol. 87, 31–45 Madan, C.R. et al. (2019) Comparative inspiration: from puzzles with pigeons to novel discoveries with humans in risky choice. Behav. Process. 160, 10–19 Eckstein, M.K. et al. (2021) What do reinforcement learning models measure? Interpreting model parameters in cognition and neuroscience. Curr. Opin. Behav. Sci. 41, 128–137 Miller, K.J. et al. (2019) Habits without values. Psychol. Rev. 126, 292 Correa, C.M.C. et al. (2018) How the level of reward awareness changes the computational and electrophysiological signatures of reinforcement learning. J. Neurosci. 38, 10338–10348 Gueguen, M.C.M. et al. (2021) Anatomical dissociation of intracerebral signals for reward and punishment prediction errors in humans. Nat. Commun. 12, 3344 Voon, V. et al. (2015) Disorders of compulsivity: a common bias towards learning habits. Mol. Psychiatry 20, 345–352 Trends in Cognitive Sciences, Month 2022, Vol. xx, No. xx 15