Recent evidence indicates that reward value encoding in humans is highly context dependent, leadi... more Recent evidence indicates that reward value encoding in humans is highly context dependent, leading to suboptimal decisions in some cases, but whether this computational constraint on valuation is a shared feature of human cognition remains unknown. Here we studied the behaviour of n = 561 individuals from 11 countries of markedly different socioeconomic and cultural makeup. Our findings show that context sensitivity was present in all 11 countries. Suboptimal decisions generated by context manipulation were not explained by risk aversion, as estimated through a separate description-based choice task (that is, lotteries) consisting of matched decision offers. Conversely, risk aversion significantly differed across countries. Overall, our findings suggest that context-dependent reward value encoding is a feature of human cognition that remains consistently present across different countries, as opposed to description-based decision-making, which is more permeable to cultural factors.
In the present study, we investigate and compare reasoning in large language models (LLMs) and hu... more In the present study, we investigate and compare reasoning in large language models (LLMs) and humans, using a selection of cognitive psychology tools traditionally dedicated to the study of (bounded) rationality. We presented to human participants and an array of pretrained LLMs new variants of classical cognitive experiments, and cross-compared their performances. Our results showed that most of the included models presented reasoning errors akin to those frequently ascribed to error-prone, heuristic-based human reasoning. Notwithstanding this superficial similarity, an indepth comparison between humans and LLMs indicated important differences with human-like reasoning, with models' limitations disappearing almost entirely in more recent LLMs' releases. Moreover, we show that while it is possible to devise strategies to induce better performance, humans and machines are not equally responsive to the same prompting schemes. We conclude by discussing the epistemological implications and challenges of comparing human and machine behavior for both artificial intelligence and cognitive psychology.
Do we preferentially learn from outcomes that confirm our choices? In recent years, we investigat... more Do we preferentially learn from outcomes that confirm our choices? In recent years, we investigated this question in a series of studies implementing increasingly complex behavioral protocols. The learning rates fitted in experiments featuring partial or complete feedback, as well as free and forced choices, were systematically found to be consistent with a choice-confirmation bias. One of the prominent behavioral consequences of the confirmatory learning rate pattern is choice hysteresis: that is, the tendency of repeating previous choices, despite contradictory evidence. However, choice-confirmatory pattern of learning rates may spuriously arise from not taking into consideration an explicit choice (gradual) perseveration term in the model. In the present study, we reanalyze data from four published papers (nine experiments; 363 subjects; 126,192 trials), originally included in the studies demonstrating or criticizing the choice-confirmation bias in human participants. We fitted two models: one featured valence-specific updates (i.e., different learning rates for confirmatory and disconfirmatory outcomes) and one additionally including gradual perseveration. Our analysis confirms that the inclusion of the gradual perseveration process in the model significantly reduces the estimated choice-confirmation bias. However, in all considered experiments, the choice-confirmation bias remains present at the meta-analytical level, and significantly different from zero in most experiments. Our results demonstrate that the choice-confirmation bias resists the inclusion of a gradual perseveration term, thus proving to be a robust feature of human reinforcement learning. We conclude by pointing to additional computational processes that may play an important role in estimating and interpreting the computational biases under scrutiny. (PsycInfo Database Record (c) 2022 APA, all rights reserved)
Understanding how learning changes during human development has been one of the long-standing obj... more Understanding how learning changes during human development has been one of the long-standing objectives of developmental science. Recently, advances in computational biology have demonstrated that humans display a bias when learning to navigate novel environments through rewards and punishments: they learn more from out
Backgrounds. Value-based decision-making impairment in depression is a complex phenomenon: while ... more Backgrounds. Value-based decision-making impairment in depression is a complex phenomenon: while some studies did find evidence of blunted reward learning and reward-related signals in the brain, others indicate no effect. Here we test whether such reward sensitivity deficits are dependent on the overall value of the decision problem. Methods. We used a two-armed bandit task with two different contexts: one 'rich', one 'poor' where both options were associated with an overall positive, negative expected value, respectively. We tested patients (N = 30) undergoing a major depressive episode and age, gender and socioeconomically matched controls (N = 26). Learning performance followed by a transfer phase, without feedback, were analyzed to distangle between a decision or a value-update process mechanism. Finally, we used computational model simulation and fitting to link behavioral patterns to learning biases. Results. Control subjects showed similar learning performance in the 'rich' and the 'poor' contexts, while patients displayed reduced learning in the 'poor' context. Analysis of the transfer phase showed that the context-dependent impairment in patients generalized, suggesting that the effect of depression has to be traced to the outcome encoding. Computational model-based results showed that patients displayed a higher learning rate for negative compared to positive outcomes (the opposite was true in controls). Conclusions. Our results illustrate that reinforcement learning performances in depression depend on the value of the context. We show that depressive patients have a specific trouble in contexts with an overall negative state value, which in our task is consistent with a negativity bias at the learning rates level.
Humans do not integrate new information objectively: outcomes carrying a positive affective value... more Humans do not integrate new information objectively: outcomes carrying a positive affective value and evidence confirming one’s own prior belief are overweighed. Until recently, theoretical and empirical accounts of the positivity and confirmation biases assumed them to be specific to ‘high-level’ belief updates. We present evidence against this account. Learning rates in reinforcement learning (RL) tasks, estimated across different contexts and species, generally present the same characteristic asymmetry, suggesting that belief and value updating processes share key computational principles and distortions. This bias generates over-optimistic expectations about the probability of making the right choices and, consequently, generates over-optimistic reward expectations. We discuss the normative and neurobiological roots of these RL biases and their position within the greater picture of behavioral decision-making theories.
A wealth of evidence in perceptual and economic decisionmaking research suggests that the subject... more A wealth of evidence in perceptual and economic decisionmaking research suggests that the subjective assessment of one option is influenced by the context. A series of studies provides evidence that the same coding principles apply to situations where decisions are shaped by past outcomes, that is, in reinforcement-learning situations. In bandit tasks, human behavior is explained by models assuming that individuals do not learn the objective value of an outcome, but rather its subjective, context-dependent representation. We argue that, while such outcome context-dependence may be informationally or ecologically optimal, it concomitantly undermines the capacity to generalize value-based knowledge to new contexts-sometimes creating apparent decision paradoxes.
Evidence suggests that economic values are rescaled as a function of the range of the available o... more Evidence suggests that economic values are rescaled as a function of the range of the available options. Although locally adaptive, range adaptation has been shown to lead to suboptimal choices, particularly notable in reinforcement learning (RL) situations when options are extrapolated from their original context to a new one. Range adaptation can be seen as the result of an adaptive coding process aiming at increasing the signal-to-noise ratio. However, this hypothesis leads to a counterintuitive prediction: Decreasing task difficulty should increase range adaptation and, consequently, extrapolation errors. Here, we tested the paradoxical relation between range adaptation and performance in a large sample of participants performing variants of an RL task, where we manipulated task difficulty. Results confirmed that range adaptation induces systematic extrapolation errors and is stronger when decreasing task difficulty. Last, we propose a range-adapting model and show that it is able to parsimoniously capture all the behavioral results.
While there is no doubt that social signals affect human reinforcement learning, there is still n... more While there is no doubt that social signals affect human reinforcement learning, there is still no consensus about how this process is computationally implemented. To address this issue, we compared three psychologically plausible hypotheses about the algorithmic implementation of imitation in reinforcement learning. The first hypothesis, decision biasing (DB), postulates that imitation consists in transiently biasing the learner's action selection without affecting their value function. According to the second hypothesis, model-based imitation (MB), the learner infers the demonstrator's value function through inverse reinforcement learning and uses it to bias action selection. Finally, according to the third hypothesis, value shaping (VS), the demonstrator's actions directly affect the learner's value function. We tested these three hypotheses in 2 experiments (N = 24 and N = 44) featuring a new variant of a social reinforcement learning task. We show through model comparison and model simulation that VS provides the best explanation of learner's behavior. Results replicated in a third independent experiment featuring a larger cohort and a different design (N = 302). In our experiments, we also manipulated the quality of the demonstrators' choices and found that learners were able to adapt their imitation rate, so that only skilled demonstrators were imitated. We proposed and tested an efficient meta-learning process to account for this effect, where imitation is regulated by the agreement between the learner and the demonstrator. In sum, our findings provide new insights and perspectives on the computational mechanisms underlying adaptive imitation in human reinforcement learning.
Language Huntington's disease Basal ganglia Brain mapping a b s t r a c t Though accumulating evi... more Language Huntington's disease Basal ganglia Brain mapping a b s t r a c t Though accumulating evidence indicates that the striatum is recruited during language processing, the specific function of this subcortical structure in language remains to be elucidated. To answer this question, we used Huntington's disease as a model of striatal lesion. We investigated the morphological deficit of 30 early Huntington's disease patients with a novel linguistic task that can be modeled within an explicit theory of linguistic computation. Behavioral results reflected an impairment in HD patients on the linguistic task. Computational model-based analysis compared the behavioral data to simulated data from two distinct lesion models, a selection deficit model and a grammatical deficit model. This analysis revealed that the impairment derives from an increased randomness in the process of selecting between grammatical alternatives, rather than from a disruption of grammatical knowledge per se. Voxel-based morphometry permitted to correlate this impairment to dorsal striatal degeneration. We thus show that the striatum holds a role in the selection of linguistic alternatives, just as in the selection of motor and cognitive programs.
Investigating the bases of inter-individual differences in risk-taking is necessary to refine our... more Investigating the bases of inter-individual differences in risk-taking is necessary to refine our cognitive and neural models of decision-making and to ultimately counter risky behaviors in real-life policy settings. However, recent evidence suggests that behavioral tasks fare poorly compared to standard questionnaires to measure individual differences in risk-taking. Crucially, using model-based measures of risk taking does not seem to improve reliability. Here, we put forward two possible-not mutually exclusive-explanations for these results and suggest future avenues of research to improve the assessment of inter-individual differences in risk-taking by combining repeated online testing and mechanistic computational models.
Money is a fundamental and ubiquitous institution in modern economies. However, the question of i... more Money is a fundamental and ubiquitous institution in modern economies. However, the question of its emergence remains a central one for economists. The monetary search-theoretic approach studies the conditions under which commodity money emerges as a solution to override frictions inherent to interindi-vidual exchanges in a decentralized economy. Although among these conditions, agents' rationality is classically essential and a prerequisite to any theoretical monetary equilibrium, human subjects often fail to adopt optimal strategies in tasks implementing a search-theoretic paradigm when these strategies are speculative, i.e., involve the use of a costly medium of exchange to increase the probability of subsequent and successful trades. In the present work, we hypothesize that implementing such speculative behaviors relies on reinforcement learning instead of lifetime utility calculations , as supposed by classical economic theory. To test this hypothesis, we operationalized the Kiyotaki and Wright paradigm of money emergence in a multistep exchange task and fitted be-havioral data regarding human subjects performing this task with two reinforcement learning models. Each of them implements a distinct cognitive hypothesis regarding the weight of future or counterfactual rewards in current decisions. We found that both models outperformed theoretical predictions about subjects' behaviors regarding the implementation of speculative strategies and that the latter relies on the degree of the opportunity costs consideration in the learning process. Speculating about the mar-ketability advantage of money thus seems to depend on mental simulations of counterfactual events that agents are performing in exchange situations. search-theoretic model | reinforcement learning | speculative behavior | opportunity cost M oney is both a very complex social phenomenon and easy to manipulate in everyday basic transactions. It is an institutional solution to common frictions in an exchange economy, such as the absence of double coincidence of wants between traders (1). It is of widespread use despite its being dominated in terms of rate of return by all other assets (2). However, it can be speculatively used in a fundamental sense: Its economically dominated holding can be justified by the anticipation of future trading opportunities that are not available at the present moment but will necessitate this particular holding. In this study, we concentrate on a paradigm of commodity-money emergence in which one of the goods exchanged in the economy becomes the selected medium of exchange despite its storage being costlier than any other good. This is typical monetary speculation, in contrast to other types of speculation, which consist in expecting an increased price on the market of a good in the future. The price of money does not vary: only the opportunity that it can afford in the future does. This seems to us to be an important feature of speculative economic behavior relative to the otherwise apparently irrational holding of such a good. We study whether individuals endowed with some information about future exchange opportunities will tend to consider a financially dominated good as a medium for exchange. Modern behaviorally founded theories of the emergence of money and monetary equilibrium (3, 4) are jointly based on the idea of minimizing a trading search process and on individual choices of accepting, declining, or postponing immediate exchanges at different costs incurred. We focus on an influent paradigm by Kiyotaki and Wright (4) (KW hereafter) in which the individual choice of accepting temporarily costly exchanges due to the anticipation of later better trading opportunities is precisely stylized as a speculative behavior and yields a corresponding monetary equilibrium. The environment of this paradigm consists of N agents specialized in terms of both consumption and production in such a manner that there is initially no double coincidence of wants. Frictions in the exchange process create a necessity for at least some of the agents to trade for goods that they neither produce nor consume, which are then used as media of exchange. The ultimate goal of agents-that is, to consume-may then require multiple steps to be achieved. The most interesting part is that in some configurations, the optimal medium of exchange (i.e., the good that maximizes expected utility because of its relatively Significance In the present study, we applied reinforcement learning models that are not classically used in experimental economics to a multistep exchange task of the emergence of money derived from a classic search-theoretic paradigm for the emergence of money. This method allowed us to highlight the importance of counterfactual feedback processing of opportunity costs in the learning process of speculative use of money and the pre-dictive power of reinforcement learning models for multistep economic tasks. Those results constitute a step toward understanding the learning processes at work in multistep economic decision-making and the cognitive microfoundations of the use of money.
The extent to which subjective awareness influences reward processing, and thereby affects future... more The extent to which subjective awareness influences reward processing, and thereby affects future decisions, is currently largely unknown. In the present report, we investigated this question in a reinforcement learning framework, combining perceptual masking, computational modeling, and electroencephalographic recordings (human male and female participants). Our results indicate that degrading the visibility of the reward decreased, without completely obliterating, the ability of participants to learn from outcomes, but concurrently increased their tendency to repeat previous choices. We dissociated electrophysiological signatures evoked by the reward-based learning processes from those elicited by the reward-independent repetition of previous choices and showed that these neural activities were significantly modulated by reward visibility. Overall, this report sheds new light on the neural computations underlying reward-based learning and decision-making and highlights that awareness is beneficial for the trial-by-trial adjustment of decision-making strategies.
In economics and perceptual decision-making contextual effects are well documented, where decisio... more In economics and perceptual decision-making contextual effects are well documented, where decision weights are adjusted as a function of the distribution of stimuli. Yet, in reinforcement learning literature whether and how contextual information pertaining to decision states is integrated in learning algorithms has received comparably little attention. Here, we investigate reinforcement learning behavior and its computational substrates in a task where we orthogonally manipulate outcome valence and magnitude, resulting in systematic variations in state-values. Model comparison indicates that subjects' behavior is best accounted for by an algorithm which includes both reference point-dependence and range-adaptation-two crucial features of state-dependent valuation. In addition, we find that state-dependent outcome valuation progressively emerges, is favored by increasing outcome information and correlated with explicit understanding of the task structure. Finally, our data clearly show that, while being locally adaptive (for instance in negative valence and small magnitude contexts), state-dependent valuation comes at the cost of seemingly irrational choices, when options are extrapolated out from their original contexts.
Adaptive coding of stimuli is well documented in perception, where it supports efficient encoding... more Adaptive coding of stimuli is well documented in perception, where it supports efficient encoding over a broad range of possible percepts. Recently, a similar neural mechanism has been reported also in value-based decision, where it allows optimal encoding of vast ranges of values in PFC: neuronal response to value depends on the choice context (relative coding), rather than being invariant across contexts (absolute coding). Additionally, value learning is sensitive to the amount of feedback information: providing complete feedback (both obtained and forgone outcomes) instead of partial feedback (only obtained outcome) improves learning. However, it is unclear whether relative coding occurs in all PFC regions and how it is affected by feedback information. We systematically investigated univariate and multivariate feedback encoding in various mPFC regions and compared three modes of neural coding: absolute, partially-adaptive and fully-adaptive. Twenty-eight human participants (both sexes) performed a learning task while undergoing fMRI scanning. On each trial, they chose between two symbols associated with a certain outcome. Then, the decision outcome was revealed. Notably, in one-half of the trials participants received partial feedback, whereas in the other half they got complete feedback. We used univariate and multivariate analysis to explore value encoding in different feedback conditions. We found that both obtained and forgone outcomes were encoded in mPFC, but with opposite sign in its ventral and dorsal subdivisions. Moreover, we showed that increasing feedback information induced a switch from absolute to relative coding. Our results suggest that complete feedback information enhances context-dependent outcome encoding. This study offers a systematic investigation of the effect of the amount of feedback information (partial vs complete) on uni-variate and multivariate outcome value encoding, within multiple regions in mPFC and cingulate cortex that are critical for value-based decisions and behavioral adaptation. Moreover, we provide the first comparison of three possible models of neu-ral coding (i.e., absolute, partially-adaptive, and fully-adaptive coding) of value signal in these regions, by using commensura-ble measures of prediction accuracy. Taken together, our results help build a more comprehensive picture of how the human brain encodes and processes outcome value. In particular, our results suggest that simultaneous presentation of obtained and foregone outcomes promotes relative value representation.
In simple instrumental-learning tasks, humans learn to seek gains and to avoid losses equally wel... more In simple instrumental-learning tasks, humans learn to seek gains and to avoid losses equally well. Yet, two effects of valence are observed. First, decisions in loss-contexts are slower. Second, loss contexts decrease individuals' confidence in their choices. Whether these two effects are two manifestations of a single mechanism or whether they can be partially dissociated is unknown. Across six experiments, we attempted to disrupt the valence-induced motor bias effects by manipulating the mapping between decisions and actions and imposing constraints on response times (RTs). Our goal was to assess the presence of the valence-induced confidence bias in the absence of the RT bias. We observed both motor and confidence biases despite our disruption attempts, establishing that the effects of valence on motor and metacognitive responses are very robust and replicable. Nonetheless, within-and between-individual inferences reveal that the confidence bias resists the disruption of the RT bias. Therefore, although concomitant in most cases, valence-induced motor and confidence biases seem to be partly dissociable. These results highlight new important mechanistic constraints that should be incorporated in learning models to jointly explain choice, reaction times and confidence.
Depending on environmental demands, humans can learn and exploit multiple concurrent sets of stim... more Depending on environmental demands, humans can learn and exploit multiple concurrent sets of stimulus-response associations. Mechanisms underlying the learning of such task-sets remain unknown. Here we investigate the hypothesis that task-set learning relies on unsupervised chunking of stimulus-response associations that occur in temporal proximity. We examine behavioral and neural data from a task-set learning experiment using a network model. We first show that task-set learning can be achieved provided the timescale of chunking is slower than the timescale of stimulus-response learning. Fitting the model to behavioral data on a subject-by-subject basis confirmed this expectation and led to specific predictions linking chunking and task-set retrieval that were borne out by behavioral performance and reaction times. Comparing the model activity with BOLD signal allowed us to identify neural correlates of task-set retrieval in a functional network involving ventral and dorsal prefrontal cortex, with the dorsal system preferentially engaged when retrievals are used to improve performance.
D etermining whether similar valence-induced biases exist in reinforcement learning and probabili... more D etermining whether similar valence-induced biases exist in reinforcement learning and probabilistic reasoning may be crucial to help refine our understanding of adaptive and maladaptive decision-making through the lens of a unified computational approach. Standard reinforcement learning models conceive agents as impartial learners: they learn equally well from positive and negative outcomes alike 1. However, empirical studies have recently come to challenge this view by demonstrating that human learners, rather than processing information impartially , consistently display a valence-induced bias: when faced with uncertain choice options, they tend to disregard bad news by integrating worse-than-expected outcomes (negative prediction errors) at a lower rate relative to better-than-expected ones (positive prediction errors) 2-4. This positivity bias would echo the asymmetric processing of self-relevant information in probabilistic reasoning, whereby good news on average receives more weight than bad news 5,6. A bias for learning preferentially from better-than-expected outcomes would reflect a preference for positive events in general. However, this prediction is at odds with recent findings. In a two-armed bandit task featuring complete feedback information, we previously found that participants would learn preferentially from better-than-expected obtained outcomes while preferentially learning from worse-than-expected forgone outcomes (that is, from the outcome associated with the option they had not chosen 7). This learning asymmetry suggests that what has been previously characterized as a positivity bias may, in fact, be the upshot of a more general, and perhaps ubiquitous, choice-confirmation bias, whereby human agents preferentially integrate information that confirms their previous decision 8. Building on these previous findings, we reasoned that if human reinforcement learning is indeed biased in a choice-confirmatory manner, learning from action-outcome couplings that were not voluntarily chosen by the subject (forced choice) should present no bias. To test this hypothesis, we conducted three experiments involving instrumental learning and computational model-based analyses. Participants were administrated new variants of a prob-abilistic learning task in which they could freely choose between two options, or were 'forced' to implement the choice made by a computer. In the first experiment, participants were only shown the obtained outcome corresponding to their choice (factual learning). In the second experiment, participants were shown both the obtained and the forgone outcome (counterfactual learning). Finally, to address a concern raised during the review process, a third experiment was included in which both free-and forced-choice trials featured a condition with a random reward schedule (50/50). The rationale for implementing this reward schedule was to test whether or not the confirmation bias was due to potential sampling differences between types of trials. Indeed, in the free-choice condition, the most rewarding symbol should be increasingly selected as the subject learns the structure of the task. Having a random reward schedule eliminates the possibility of such unbalanced sampling between free-and forced-choice conditions. We had two key predictions. With regard to factual learning, participants should learn better from positive prediction error, but they should only do so when free to choose (free-choice trials), while showing no effect when forced to match a computer's choice (forced-choice trials). With regard to counterfactual learning from forgone outcomes, we expected the opposite pattern: in free-choice trials, negative prediction errors should be more likely to be taken The valence of new information influences learning rates in humans: good news tends to receive more weight than bad news. We investigated this learning bias in four experiments, by systematically manipulating the source of required action (free versus forced choices), outcome contingencies (low versus high reward) and motor requirements (go versus no-go choices). Analysis of model-estimated learning rates showed that the confirmation bias in learning rates was specific to free choices, but was independent of outcome contingencies. The bias was also unaffected by the motor requirements, thus suggesting that it operates in the representational space of decisions, rather than motoric actions. Finally, model simulations revealed that learning rates estimated from the choice-confirmation model had the effect of maximizing performance across low-and high-reward environments. We therefore suggest that choice-confirmation bias may be adaptive for efficient learning of action-outcome contingencies, above and beyond fostering person-level dispositions such as self-esteem. NaTure HuMaN BeHaVIour | www.nature.com/nathumbehav
Recent evidence indicates that reward value encoding in humans is highly context dependent, leadi... more Recent evidence indicates that reward value encoding in humans is highly context dependent, leading to suboptimal decisions in some cases, but whether this computational constraint on valuation is a shared feature of human cognition remains unknown. Here we studied the behaviour of n = 561 individuals from 11 countries of markedly different socioeconomic and cultural makeup. Our findings show that context sensitivity was present in all 11 countries. Suboptimal decisions generated by context manipulation were not explained by risk aversion, as estimated through a separate description-based choice task (that is, lotteries) consisting of matched decision offers. Conversely, risk aversion significantly differed across countries. Overall, our findings suggest that context-dependent reward value encoding is a feature of human cognition that remains consistently present across different countries, as opposed to description-based decision-making, which is more permeable to cultural factors.
In the present study, we investigate and compare reasoning in large language models (LLMs) and hu... more In the present study, we investigate and compare reasoning in large language models (LLMs) and humans, using a selection of cognitive psychology tools traditionally dedicated to the study of (bounded) rationality. We presented to human participants and an array of pretrained LLMs new variants of classical cognitive experiments, and cross-compared their performances. Our results showed that most of the included models presented reasoning errors akin to those frequently ascribed to error-prone, heuristic-based human reasoning. Notwithstanding this superficial similarity, an indepth comparison between humans and LLMs indicated important differences with human-like reasoning, with models' limitations disappearing almost entirely in more recent LLMs' releases. Moreover, we show that while it is possible to devise strategies to induce better performance, humans and machines are not equally responsive to the same prompting schemes. We conclude by discussing the epistemological implications and challenges of comparing human and machine behavior for both artificial intelligence and cognitive psychology.
Do we preferentially learn from outcomes that confirm our choices? In recent years, we investigat... more Do we preferentially learn from outcomes that confirm our choices? In recent years, we investigated this question in a series of studies implementing increasingly complex behavioral protocols. The learning rates fitted in experiments featuring partial or complete feedback, as well as free and forced choices, were systematically found to be consistent with a choice-confirmation bias. One of the prominent behavioral consequences of the confirmatory learning rate pattern is choice hysteresis: that is, the tendency of repeating previous choices, despite contradictory evidence. However, choice-confirmatory pattern of learning rates may spuriously arise from not taking into consideration an explicit choice (gradual) perseveration term in the model. In the present study, we reanalyze data from four published papers (nine experiments; 363 subjects; 126,192 trials), originally included in the studies demonstrating or criticizing the choice-confirmation bias in human participants. We fitted two models: one featured valence-specific updates (i.e., different learning rates for confirmatory and disconfirmatory outcomes) and one additionally including gradual perseveration. Our analysis confirms that the inclusion of the gradual perseveration process in the model significantly reduces the estimated choice-confirmation bias. However, in all considered experiments, the choice-confirmation bias remains present at the meta-analytical level, and significantly different from zero in most experiments. Our results demonstrate that the choice-confirmation bias resists the inclusion of a gradual perseveration term, thus proving to be a robust feature of human reinforcement learning. We conclude by pointing to additional computational processes that may play an important role in estimating and interpreting the computational biases under scrutiny. (PsycInfo Database Record (c) 2022 APA, all rights reserved)
Understanding how learning changes during human development has been one of the long-standing obj... more Understanding how learning changes during human development has been one of the long-standing objectives of developmental science. Recently, advances in computational biology have demonstrated that humans display a bias when learning to navigate novel environments through rewards and punishments: they learn more from out
Backgrounds. Value-based decision-making impairment in depression is a complex phenomenon: while ... more Backgrounds. Value-based decision-making impairment in depression is a complex phenomenon: while some studies did find evidence of blunted reward learning and reward-related signals in the brain, others indicate no effect. Here we test whether such reward sensitivity deficits are dependent on the overall value of the decision problem. Methods. We used a two-armed bandit task with two different contexts: one 'rich', one 'poor' where both options were associated with an overall positive, negative expected value, respectively. We tested patients (N = 30) undergoing a major depressive episode and age, gender and socioeconomically matched controls (N = 26). Learning performance followed by a transfer phase, without feedback, were analyzed to distangle between a decision or a value-update process mechanism. Finally, we used computational model simulation and fitting to link behavioral patterns to learning biases. Results. Control subjects showed similar learning performance in the 'rich' and the 'poor' contexts, while patients displayed reduced learning in the 'poor' context. Analysis of the transfer phase showed that the context-dependent impairment in patients generalized, suggesting that the effect of depression has to be traced to the outcome encoding. Computational model-based results showed that patients displayed a higher learning rate for negative compared to positive outcomes (the opposite was true in controls). Conclusions. Our results illustrate that reinforcement learning performances in depression depend on the value of the context. We show that depressive patients have a specific trouble in contexts with an overall negative state value, which in our task is consistent with a negativity bias at the learning rates level.
Humans do not integrate new information objectively: outcomes carrying a positive affective value... more Humans do not integrate new information objectively: outcomes carrying a positive affective value and evidence confirming one’s own prior belief are overweighed. Until recently, theoretical and empirical accounts of the positivity and confirmation biases assumed them to be specific to ‘high-level’ belief updates. We present evidence against this account. Learning rates in reinforcement learning (RL) tasks, estimated across different contexts and species, generally present the same characteristic asymmetry, suggesting that belief and value updating processes share key computational principles and distortions. This bias generates over-optimistic expectations about the probability of making the right choices and, consequently, generates over-optimistic reward expectations. We discuss the normative and neurobiological roots of these RL biases and their position within the greater picture of behavioral decision-making theories.
A wealth of evidence in perceptual and economic decisionmaking research suggests that the subject... more A wealth of evidence in perceptual and economic decisionmaking research suggests that the subjective assessment of one option is influenced by the context. A series of studies provides evidence that the same coding principles apply to situations where decisions are shaped by past outcomes, that is, in reinforcement-learning situations. In bandit tasks, human behavior is explained by models assuming that individuals do not learn the objective value of an outcome, but rather its subjective, context-dependent representation. We argue that, while such outcome context-dependence may be informationally or ecologically optimal, it concomitantly undermines the capacity to generalize value-based knowledge to new contexts-sometimes creating apparent decision paradoxes.
Evidence suggests that economic values are rescaled as a function of the range of the available o... more Evidence suggests that economic values are rescaled as a function of the range of the available options. Although locally adaptive, range adaptation has been shown to lead to suboptimal choices, particularly notable in reinforcement learning (RL) situations when options are extrapolated from their original context to a new one. Range adaptation can be seen as the result of an adaptive coding process aiming at increasing the signal-to-noise ratio. However, this hypothesis leads to a counterintuitive prediction: Decreasing task difficulty should increase range adaptation and, consequently, extrapolation errors. Here, we tested the paradoxical relation between range adaptation and performance in a large sample of participants performing variants of an RL task, where we manipulated task difficulty. Results confirmed that range adaptation induces systematic extrapolation errors and is stronger when decreasing task difficulty. Last, we propose a range-adapting model and show that it is able to parsimoniously capture all the behavioral results.
While there is no doubt that social signals affect human reinforcement learning, there is still n... more While there is no doubt that social signals affect human reinforcement learning, there is still no consensus about how this process is computationally implemented. To address this issue, we compared three psychologically plausible hypotheses about the algorithmic implementation of imitation in reinforcement learning. The first hypothesis, decision biasing (DB), postulates that imitation consists in transiently biasing the learner's action selection without affecting their value function. According to the second hypothesis, model-based imitation (MB), the learner infers the demonstrator's value function through inverse reinforcement learning and uses it to bias action selection. Finally, according to the third hypothesis, value shaping (VS), the demonstrator's actions directly affect the learner's value function. We tested these three hypotheses in 2 experiments (N = 24 and N = 44) featuring a new variant of a social reinforcement learning task. We show through model comparison and model simulation that VS provides the best explanation of learner's behavior. Results replicated in a third independent experiment featuring a larger cohort and a different design (N = 302). In our experiments, we also manipulated the quality of the demonstrators' choices and found that learners were able to adapt their imitation rate, so that only skilled demonstrators were imitated. We proposed and tested an efficient meta-learning process to account for this effect, where imitation is regulated by the agreement between the learner and the demonstrator. In sum, our findings provide new insights and perspectives on the computational mechanisms underlying adaptive imitation in human reinforcement learning.
Language Huntington's disease Basal ganglia Brain mapping a b s t r a c t Though accumulating evi... more Language Huntington's disease Basal ganglia Brain mapping a b s t r a c t Though accumulating evidence indicates that the striatum is recruited during language processing, the specific function of this subcortical structure in language remains to be elucidated. To answer this question, we used Huntington's disease as a model of striatal lesion. We investigated the morphological deficit of 30 early Huntington's disease patients with a novel linguistic task that can be modeled within an explicit theory of linguistic computation. Behavioral results reflected an impairment in HD patients on the linguistic task. Computational model-based analysis compared the behavioral data to simulated data from two distinct lesion models, a selection deficit model and a grammatical deficit model. This analysis revealed that the impairment derives from an increased randomness in the process of selecting between grammatical alternatives, rather than from a disruption of grammatical knowledge per se. Voxel-based morphometry permitted to correlate this impairment to dorsal striatal degeneration. We thus show that the striatum holds a role in the selection of linguistic alternatives, just as in the selection of motor and cognitive programs.
Investigating the bases of inter-individual differences in risk-taking is necessary to refine our... more Investigating the bases of inter-individual differences in risk-taking is necessary to refine our cognitive and neural models of decision-making and to ultimately counter risky behaviors in real-life policy settings. However, recent evidence suggests that behavioral tasks fare poorly compared to standard questionnaires to measure individual differences in risk-taking. Crucially, using model-based measures of risk taking does not seem to improve reliability. Here, we put forward two possible-not mutually exclusive-explanations for these results and suggest future avenues of research to improve the assessment of inter-individual differences in risk-taking by combining repeated online testing and mechanistic computational models.
Money is a fundamental and ubiquitous institution in modern economies. However, the question of i... more Money is a fundamental and ubiquitous institution in modern economies. However, the question of its emergence remains a central one for economists. The monetary search-theoretic approach studies the conditions under which commodity money emerges as a solution to override frictions inherent to interindi-vidual exchanges in a decentralized economy. Although among these conditions, agents' rationality is classically essential and a prerequisite to any theoretical monetary equilibrium, human subjects often fail to adopt optimal strategies in tasks implementing a search-theoretic paradigm when these strategies are speculative, i.e., involve the use of a costly medium of exchange to increase the probability of subsequent and successful trades. In the present work, we hypothesize that implementing such speculative behaviors relies on reinforcement learning instead of lifetime utility calculations , as supposed by classical economic theory. To test this hypothesis, we operationalized the Kiyotaki and Wright paradigm of money emergence in a multistep exchange task and fitted be-havioral data regarding human subjects performing this task with two reinforcement learning models. Each of them implements a distinct cognitive hypothesis regarding the weight of future or counterfactual rewards in current decisions. We found that both models outperformed theoretical predictions about subjects' behaviors regarding the implementation of speculative strategies and that the latter relies on the degree of the opportunity costs consideration in the learning process. Speculating about the mar-ketability advantage of money thus seems to depend on mental simulations of counterfactual events that agents are performing in exchange situations. search-theoretic model | reinforcement learning | speculative behavior | opportunity cost M oney is both a very complex social phenomenon and easy to manipulate in everyday basic transactions. It is an institutional solution to common frictions in an exchange economy, such as the absence of double coincidence of wants between traders (1). It is of widespread use despite its being dominated in terms of rate of return by all other assets (2). However, it can be speculatively used in a fundamental sense: Its economically dominated holding can be justified by the anticipation of future trading opportunities that are not available at the present moment but will necessitate this particular holding. In this study, we concentrate on a paradigm of commodity-money emergence in which one of the goods exchanged in the economy becomes the selected medium of exchange despite its storage being costlier than any other good. This is typical monetary speculation, in contrast to other types of speculation, which consist in expecting an increased price on the market of a good in the future. The price of money does not vary: only the opportunity that it can afford in the future does. This seems to us to be an important feature of speculative economic behavior relative to the otherwise apparently irrational holding of such a good. We study whether individuals endowed with some information about future exchange opportunities will tend to consider a financially dominated good as a medium for exchange. Modern behaviorally founded theories of the emergence of money and monetary equilibrium (3, 4) are jointly based on the idea of minimizing a trading search process and on individual choices of accepting, declining, or postponing immediate exchanges at different costs incurred. We focus on an influent paradigm by Kiyotaki and Wright (4) (KW hereafter) in which the individual choice of accepting temporarily costly exchanges due to the anticipation of later better trading opportunities is precisely stylized as a speculative behavior and yields a corresponding monetary equilibrium. The environment of this paradigm consists of N agents specialized in terms of both consumption and production in such a manner that there is initially no double coincidence of wants. Frictions in the exchange process create a necessity for at least some of the agents to trade for goods that they neither produce nor consume, which are then used as media of exchange. The ultimate goal of agents-that is, to consume-may then require multiple steps to be achieved. The most interesting part is that in some configurations, the optimal medium of exchange (i.e., the good that maximizes expected utility because of its relatively Significance In the present study, we applied reinforcement learning models that are not classically used in experimental economics to a multistep exchange task of the emergence of money derived from a classic search-theoretic paradigm for the emergence of money. This method allowed us to highlight the importance of counterfactual feedback processing of opportunity costs in the learning process of speculative use of money and the pre-dictive power of reinforcement learning models for multistep economic tasks. Those results constitute a step toward understanding the learning processes at work in multistep economic decision-making and the cognitive microfoundations of the use of money.
The extent to which subjective awareness influences reward processing, and thereby affects future... more The extent to which subjective awareness influences reward processing, and thereby affects future decisions, is currently largely unknown. In the present report, we investigated this question in a reinforcement learning framework, combining perceptual masking, computational modeling, and electroencephalographic recordings (human male and female participants). Our results indicate that degrading the visibility of the reward decreased, without completely obliterating, the ability of participants to learn from outcomes, but concurrently increased their tendency to repeat previous choices. We dissociated electrophysiological signatures evoked by the reward-based learning processes from those elicited by the reward-independent repetition of previous choices and showed that these neural activities were significantly modulated by reward visibility. Overall, this report sheds new light on the neural computations underlying reward-based learning and decision-making and highlights that awareness is beneficial for the trial-by-trial adjustment of decision-making strategies.
In economics and perceptual decision-making contextual effects are well documented, where decisio... more In economics and perceptual decision-making contextual effects are well documented, where decision weights are adjusted as a function of the distribution of stimuli. Yet, in reinforcement learning literature whether and how contextual information pertaining to decision states is integrated in learning algorithms has received comparably little attention. Here, we investigate reinforcement learning behavior and its computational substrates in a task where we orthogonally manipulate outcome valence and magnitude, resulting in systematic variations in state-values. Model comparison indicates that subjects' behavior is best accounted for by an algorithm which includes both reference point-dependence and range-adaptation-two crucial features of state-dependent valuation. In addition, we find that state-dependent outcome valuation progressively emerges, is favored by increasing outcome information and correlated with explicit understanding of the task structure. Finally, our data clearly show that, while being locally adaptive (for instance in negative valence and small magnitude contexts), state-dependent valuation comes at the cost of seemingly irrational choices, when options are extrapolated out from their original contexts.
Adaptive coding of stimuli is well documented in perception, where it supports efficient encoding... more Adaptive coding of stimuli is well documented in perception, where it supports efficient encoding over a broad range of possible percepts. Recently, a similar neural mechanism has been reported also in value-based decision, where it allows optimal encoding of vast ranges of values in PFC: neuronal response to value depends on the choice context (relative coding), rather than being invariant across contexts (absolute coding). Additionally, value learning is sensitive to the amount of feedback information: providing complete feedback (both obtained and forgone outcomes) instead of partial feedback (only obtained outcome) improves learning. However, it is unclear whether relative coding occurs in all PFC regions and how it is affected by feedback information. We systematically investigated univariate and multivariate feedback encoding in various mPFC regions and compared three modes of neural coding: absolute, partially-adaptive and fully-adaptive. Twenty-eight human participants (both sexes) performed a learning task while undergoing fMRI scanning. On each trial, they chose between two symbols associated with a certain outcome. Then, the decision outcome was revealed. Notably, in one-half of the trials participants received partial feedback, whereas in the other half they got complete feedback. We used univariate and multivariate analysis to explore value encoding in different feedback conditions. We found that both obtained and forgone outcomes were encoded in mPFC, but with opposite sign in its ventral and dorsal subdivisions. Moreover, we showed that increasing feedback information induced a switch from absolute to relative coding. Our results suggest that complete feedback information enhances context-dependent outcome encoding. This study offers a systematic investigation of the effect of the amount of feedback information (partial vs complete) on uni-variate and multivariate outcome value encoding, within multiple regions in mPFC and cingulate cortex that are critical for value-based decisions and behavioral adaptation. Moreover, we provide the first comparison of three possible models of neu-ral coding (i.e., absolute, partially-adaptive, and fully-adaptive coding) of value signal in these regions, by using commensura-ble measures of prediction accuracy. Taken together, our results help build a more comprehensive picture of how the human brain encodes and processes outcome value. In particular, our results suggest that simultaneous presentation of obtained and foregone outcomes promotes relative value representation.
In simple instrumental-learning tasks, humans learn to seek gains and to avoid losses equally wel... more In simple instrumental-learning tasks, humans learn to seek gains and to avoid losses equally well. Yet, two effects of valence are observed. First, decisions in loss-contexts are slower. Second, loss contexts decrease individuals' confidence in their choices. Whether these two effects are two manifestations of a single mechanism or whether they can be partially dissociated is unknown. Across six experiments, we attempted to disrupt the valence-induced motor bias effects by manipulating the mapping between decisions and actions and imposing constraints on response times (RTs). Our goal was to assess the presence of the valence-induced confidence bias in the absence of the RT bias. We observed both motor and confidence biases despite our disruption attempts, establishing that the effects of valence on motor and metacognitive responses are very robust and replicable. Nonetheless, within-and between-individual inferences reveal that the confidence bias resists the disruption of the RT bias. Therefore, although concomitant in most cases, valence-induced motor and confidence biases seem to be partly dissociable. These results highlight new important mechanistic constraints that should be incorporated in learning models to jointly explain choice, reaction times and confidence.
Depending on environmental demands, humans can learn and exploit multiple concurrent sets of stim... more Depending on environmental demands, humans can learn and exploit multiple concurrent sets of stimulus-response associations. Mechanisms underlying the learning of such task-sets remain unknown. Here we investigate the hypothesis that task-set learning relies on unsupervised chunking of stimulus-response associations that occur in temporal proximity. We examine behavioral and neural data from a task-set learning experiment using a network model. We first show that task-set learning can be achieved provided the timescale of chunking is slower than the timescale of stimulus-response learning. Fitting the model to behavioral data on a subject-by-subject basis confirmed this expectation and led to specific predictions linking chunking and task-set retrieval that were borne out by behavioral performance and reaction times. Comparing the model activity with BOLD signal allowed us to identify neural correlates of task-set retrieval in a functional network involving ventral and dorsal prefrontal cortex, with the dorsal system preferentially engaged when retrievals are used to improve performance.
D etermining whether similar valence-induced biases exist in reinforcement learning and probabili... more D etermining whether similar valence-induced biases exist in reinforcement learning and probabilistic reasoning may be crucial to help refine our understanding of adaptive and maladaptive decision-making through the lens of a unified computational approach. Standard reinforcement learning models conceive agents as impartial learners: they learn equally well from positive and negative outcomes alike 1. However, empirical studies have recently come to challenge this view by demonstrating that human learners, rather than processing information impartially , consistently display a valence-induced bias: when faced with uncertain choice options, they tend to disregard bad news by integrating worse-than-expected outcomes (negative prediction errors) at a lower rate relative to better-than-expected ones (positive prediction errors) 2-4. This positivity bias would echo the asymmetric processing of self-relevant information in probabilistic reasoning, whereby good news on average receives more weight than bad news 5,6. A bias for learning preferentially from better-than-expected outcomes would reflect a preference for positive events in general. However, this prediction is at odds with recent findings. In a two-armed bandit task featuring complete feedback information, we previously found that participants would learn preferentially from better-than-expected obtained outcomes while preferentially learning from worse-than-expected forgone outcomes (that is, from the outcome associated with the option they had not chosen 7). This learning asymmetry suggests that what has been previously characterized as a positivity bias may, in fact, be the upshot of a more general, and perhaps ubiquitous, choice-confirmation bias, whereby human agents preferentially integrate information that confirms their previous decision 8. Building on these previous findings, we reasoned that if human reinforcement learning is indeed biased in a choice-confirmatory manner, learning from action-outcome couplings that were not voluntarily chosen by the subject (forced choice) should present no bias. To test this hypothesis, we conducted three experiments involving instrumental learning and computational model-based analyses. Participants were administrated new variants of a prob-abilistic learning task in which they could freely choose between two options, or were 'forced' to implement the choice made by a computer. In the first experiment, participants were only shown the obtained outcome corresponding to their choice (factual learning). In the second experiment, participants were shown both the obtained and the forgone outcome (counterfactual learning). Finally, to address a concern raised during the review process, a third experiment was included in which both free-and forced-choice trials featured a condition with a random reward schedule (50/50). The rationale for implementing this reward schedule was to test whether or not the confirmation bias was due to potential sampling differences between types of trials. Indeed, in the free-choice condition, the most rewarding symbol should be increasingly selected as the subject learns the structure of the task. Having a random reward schedule eliminates the possibility of such unbalanced sampling between free-and forced-choice conditions. We had two key predictions. With regard to factual learning, participants should learn better from positive prediction error, but they should only do so when free to choose (free-choice trials), while showing no effect when forced to match a computer's choice (forced-choice trials). With regard to counterfactual learning from forgone outcomes, we expected the opposite pattern: in free-choice trials, negative prediction errors should be more likely to be taken The valence of new information influences learning rates in humans: good news tends to receive more weight than bad news. We investigated this learning bias in four experiments, by systematically manipulating the source of required action (free versus forced choices), outcome contingencies (low versus high reward) and motor requirements (go versus no-go choices). Analysis of model-estimated learning rates showed that the confirmation bias in learning rates was specific to free choices, but was independent of outcome contingencies. The bias was also unaffected by the motor requirements, thus suggesting that it operates in the representational space of decisions, rather than motoric actions. Finally, model simulations revealed that learning rates estimated from the choice-confirmation model had the effect of maximizing performance across low-and high-reward environments. We therefore suggest that choice-confirmation bias may be adaptive for efficient learning of action-outcome contingencies, above and beyond fostering person-level dispositions such as self-esteem. NaTure HuMaN BeHaVIour | www.nature.com/nathumbehav
Approaching rewards and avoiding punishments are core principles that govern the adaptation of be... more Approaching rewards and avoiding punishments are core principles that govern the adaptation of behavior to the environment. The machine learning literature has proposed formal algorithms to account for how agents adapt their decisions to optimize outcomes. In principle, these reinforcement learning models could be equally applied to positive and negative outcomes, ie, rewards and punishments. Yet many neuroscience studies have suggested that reward and punishment learning might be underpinned by distinct brain systems. Reward learning has been shown to recruit midbrain dopaminergic nuclei and ventral pre-frontostriatal circuits. The picture is less clear regarding the existence and anatomy of an opponent system: several hypotheses have been formulated for the neural implementation of punishment learning. In this chapter, we review the evidence for and against each hypothesis, focusing on human studies that compare the effects of neural perturbation, following drug administration and/or pathological conditions , on reward and punishment learning. Good and evil, reward and punishment, are the only motives to a rational creature: these are the spur and reins whereby all mankind are set on work, and guided. These famous words by John Locke suggest that rewards and punishments are not on a continuum from positive to negative: they pertain to distinct categories of events that we can imagine or experience. Indeed rewards and punishments trigger different kinds of subjective feelings (such as pleasure versus pain or desire versus dread) and elicit different types of behaviors (approach versus avoidance or invigoration versus inhibition). These considerations might suggest the idea that rewards and punishments are processed by different parts of the brain. In this chapter we examine this idea in the context of reinforcement learning, a computational process that could in principle apply equally to rewards and punishments. We start by summarizing the computational principles underlying reinforcement learning (Box 23.1 and Fig. 23.1) and by describing typical tasks that implement a comparison between reward and punishment learning (Box 23.2 and Fig. 23.2). Then we expose the current hypotheses about the possible implementation of reward and punishment learning systems in the brain (Fig. 23.3). Last, BOX 23.1 The first reinforcement learning (RL) models come from the behaviorist tradition, in the form of mathematical laws describing learning curves [82] or formal descriptions of associative conditioning [2]. Subsequently, in the 1980s, computational investigation of RL received a significant boost when it grabbed the attention of machine learning scholars, who were aiming at developing algorithms for goal-oriented artificial agents [1]. In the Continued 291 Decision Neuroscience
Uploads
Papers by Stefano Palminteri