Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Research Articles so that the algorithm is an exploitation state (Fig. 1): the reliable strategy is the actor, namely the unique strategy selecting and learning the actions that maximize rewards (typically through reinforcement learning), while the other monitored strategies are treated as Maël Donoso,1,2,3 Anne G. E. Collins,2,4 Etienne Koechlin1,2,3* counterfactual. When all monitored 1 Institut National de la Santé et de la Recherche Médicale, Paris, France. 2Ecole Normale Supérieure, strategies become unreliable, the algoParis, France. 3Université Pierre et Marie Curie, Paris, France. 4Brown University, Providence, RI 02912, rithm then switches into an exploration USA. state corresponding to hypothesis test*Corresponding author. E-mail: etienne.koechlin@upmc.fr ing: a new strategy is formed as a weighted mixture of strategies stored in The prefrontal cortex (PFC) subserves reasoning in the service of adaptive long-term memory, then probed and behavior. Little is known however about the architecture of reasoning processes in monitored as actor (9). A priori unreliathe PFC. Using computational modeling and neuroimaging, we show here that the human PFC comprises two concurrent inferential tracks: one from ventromedial to ble, this probe actor learns so that the dorsomedial PFC regions that makes probabilistic inferences about the reliability of algorithm may subsequently return to the ongoing behavioral strategy and arbitrates between adjusting this strategy vs. the exploitation state in two ways. Eiexploring new ones from long-term memory; another from polar to lateral PFC ther one counterfactual strategy beregions that makes probabilistic inferences about the reliability of two/three comes reliable, while the probe actor alternative strategies and arbitrates between exploring new strategies vs. exploiting remains unreliable: the former is then these alternative ones. The two tracks interact and along with the striatum, realize retrieved as actor and the latter is rehypothesis-testing for accepting vs. rejecting newly created strategies. jected (disbanded). Or the probe actor becomes reliable, while counterfactual Human reasoning subserves adaptive behavior and has evolved facing strategies remain unreliable. The probe actor is then confirmed: remainthe uncertainty of everyday environments. In such situations, probabilis- ing the actor, the new strategy is simply consolidated in long-term tic inferential processes (i.e., Bayesian inferences) make optimal use of memory, thereby expanding the repertoire of stored strategies. In case available information for making decisions. Human reasoning involves the inferential buffer has further reached its capacity limit, the counterBayesian inferences accounting for human responses that often deviate factual strategy used the least recently as actor is then discarded from the from formal logic (1). Bayesian inferences also operate in the prefrontal buffer (but remained stored in long-term memory). cortex and guide behavioral choices (2, 3). Everyday environments, Consistent with the capacity-limit of human working memory (10), however, are changing and open-ended, so that the range of uncertain human decisions are best predicted, when the inferential buffer is limited situations and associated behavioral strategies (i.e., internal mappings to two/three concurrent counterfactual strategies (8). We then hypothelinking stimuli, actions and expected outcomes) becomes potentially sized that the human PFC implements this algorithm. We expected anteinfinite. In such environments, probabilistic inferences involve Dirichlet rior PFC regions to form the inferential buffer (3, 11–13), and more processes mixtures (4–7) and rapidly yield to intractable computations. posterior PFC regions in association with basal ganglia to drive actor This computational complexity problem constitutes a fundamental con- learning, selection and creation based on hypothesis-testing (14–18). The straint bearing upon the evolution of higher cognitive functions and rais- model predicts that anterior PFC regions concurrently infer the absolute es the issue of the actual nature of inferential processes implemented in reliability of actor and counterfactual strategies the algorithm builds the prefrontal cortex. online. More posterior PFC regions then detect when in the inferential buffer, actor strategies become unreliable for creating probe actors as A Model of Reasoning Processes in the Human Prefrontal Cortex well as when counterfactual strategies become reliable for retrieving To address this issue, we proposed a model (8) that describes human them as actor (and possibly rejecting probe actors). In basal ganglia, the reasoning guiding behavior as a computationally tractable, online algo- ventral striatum subserves reinforcement learning (16, 19, 20) and is rithm approximating Dirichlet processes mixtures (9). The algorithm predicted to detect when in the inferential buffer, probe actors become combines forward Bayesian inferences operating over a few concurrent reliable for confirming them in long-term memory (21). behavioral strategies stored in long-term memory and hypothesis testing for possibly updating this inferential buffer with new strategies formed Behavioral Paradigm from long-term memory. The algorithm notably serves to arbitrate be- To test these predictions, we used functional magnetic resonance imagtween: (i) staying with the ongoing behavioral strategy and possibly ing and scanned 40 healthy participants, while they were responding to learning external contingencies; (ii) switching to other learned strategies; successively presented digits and searching for 3-digit combinations by and (iii) forming new behavioral strategies. trials and errors (fig. S1) (9). Feedbacks were noisy and combinations For integrating online Bayesian inferences and hypothesis testing, changed episodically. Unbeknownst to them, participants performed two the algorithm key feature is to infer the absolute reliability of every distinct sessions. In the open session, every episode corresponded to new monitored strategy: namely, the posterior probability that the current combinations, whereas in the recurrent session, only three combinations situation matches the situation the strategy has learned, given both action reoccurred unpredictably across episodes. The protocol thus induced outcomes (and possibly contextual cues) and the possibility that no participants to reason from feedbacks whether they had to perseverate match occurs with any monitored strategies. To estimate these probabili- with the same combination and possibly adjust it, reuse previously ties, the model assumes that in this latter case, action outcomes expected learned ones, or learn/search for new combinations. from the monitored strategies are equiprobable (9). Thus, every moniIn every trial, participants’ responses were either correct, perseveratored strategy may appear as being either reliable (i.e. more likely tive (incorrect but correct in the preceding episode) or exploratory (neimatching than not matching the current situation) or unreliable (the con- ther correct nor perseverative). Overall, participants performed much verse). When a strategy is reliable, the others are necessarily unreliable below the statistical optimum (8). In both conditions, correct response / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 1 / 10.1126/science.1252254 Downloaded from www.sciencemag.org on May 30, 2014 Foundations of human reasoning in the prefrontal cortex rates increased from ∼5% at episode onsets to a plateau at ∼85% about 25 trials later (chance level: 25%) (Fig. 2, left). Exploratory response rates increased from ∼10% at episode onsets, peaked at ∼40% five trials later and then returned to ∼10% (chance level: 50%). Correct responses increased and exploratory responses vanished faster in the recurrent than open episodes (both Fs > 21.8, Ps < 0.0001). In the first trials of recurrent episodes, furthermore, a positive feedback caused the production of correct responses in the next trial, even when the two successively presented digits differed: the statistical dependence between two successive correct responses increased in the first trials of recurrent compared to open episodes (Trials 1 and 2: Ts > 2.25, Ps < 0.03) (Fig. 2, bottom), while remaining similar in both conditions on the following trials. In these first recurrent trials, accordingly, participants used feedbacks to retrieve previously learned combinations rather than recollecting each digit-response association separately. Participants consequently built and stored multiple combinations and monitored feedbacks for either retrieving these combinations or learning new ones. Combinations thus defined behavioral strategies associating digits, responses and expected feedbacks. We fit the model free parameters (buffer-capacity, prior reliability and recollection entropy of probe actors, reinforcement learning parameters) to each participant’s series of responses (table S1) (9). In both recurrent and open episodes, the fitted model predicted participants’ responses and their statistical dependencies across successive trials (Fig. 2, right). The model fit significantly better than alternative models, independently of model complexity and fitting criteria (fig. S2) (9). Moreover, fitted parameters were independent of which session was fitted (T < 1) and consequently, unrelated to the number of combinations used in recurrent sessions. The best-fitting capacity—whether fixed or averaged across subjects—included 2 counterfactual strategies (mean = 2.6, sem = 0.24, median = 2) (table S1). The model critically reveals that the gradual variations of responses reported above are actually artifacts from aligning performances from episode onsets and averaging across episodes (Fig. 3). Following most episode changes (93.9/94.2% of recurrent/open episodes, resp.), indeed, the algorithm switched from exploitation to exploration and created probe actors from long-term memory at variable time points across episodes (on average 3.3 (sd = 0.9) and 4.2 (sd = 1.3) trials after recurrent and open episode onsets, resp.). We refer to these algorithmic transitions as switch-in events. Realigning model and participants’ performances on these switch-in events rather than episode onsets (Fig. 3, left) shows that in exploitation trials preceding switch-in events, both model and participants’ responses were virtually unaffected by episode changes and remained mostly perseverative (∼85-90%), while residual responses remained randomly distributed across exploratory and correct responses (∼8% and ∼4% of residual responses, respectively). In switch-in trials, by contrast, perseverative responses abruptly dropped off (∼40%) and exploratory responses abruptly increased to a plateau (∼35-40%). In exploration trials following switch-in events, both model and participants’ exploratory responses remain on the plateau, while perseverative responses slowly decreased (correct responses consequently increased slowly). In 43% of recurrent episodes, the algorithm terminated these exploration periods by retrieving counterfactual strategies and rejecting probe actors (on average 10.1 (sd = 3.2) after episode onsets). In the remaining recurrent episodes (57%) and most open episodes (84%), the algorithm terminated exploration by confirming probe actors in long-term memory (on average 6.7 (sd = 3.3) and 8.1 (sd = 3.9) trials after recurrent and open episode onsets, resp.). We refer to these algorithmic transitions as rejection and confirmation events, respectively. Realigning again model and participants’ performances on these rejection and confirmation events reveals that (Fig. 3, right): when rejection events occurred, both model and participants’ correct responses abruptly increased and explor- atory responses abruptly dropped off; when confirmation events occurred, by contrast, correct and exploratory responses exhibited no abrupt changes and as expected, gradually increased and decreased, respectively (more results in supplementary online text). Brain Activations Associated with Reasoning Computations We then investigated whether fMRI activations confirm the implementation of the proposed algorithm in the prefrontal cortex. To identify activations associated with inferring strategies’ absolute reliability, we considered three reliability variables derived from the best-fitting model: actor, first- and second- alternative reliability. We entered these variables orthogonalized in that order in a unique regression analysis, which also included algorithmic events switch-in, rejection, confirmation as regressors, along with those modeling exploration and exploitation trials (9). The regression factored out possible confounding variables including reward expectations, outcome predictions and feedback values. Activations were identified using significance thresholds set to P = 0.05 (FWE corrected for multiple comparisons over the frontal lobes) and post-hoc analyzes removed selection biases (22). Strategies’ reliability correlated with anterior PFC activations. Actor reliability correlated with ventromedial PFC (vmPFC) and perigenual anterior cingulate (pgACC) activations, while right frontopolar (FPC) activations correlated concurrently with both first- and second- alternative reliability (Fig. 4). No other regions exhibited such correlations (p > 0.01, uncorrected). vmPFC/pgACC activations increasing with actor reliability further decreased with first- and more strongly with secondalternative reliability, whereas right FPC activations decreased with actor reliability while increasing with first- and more strongly with second-alternative reliability (Fig. 4). The symmetrical, left FPC region marginally exhibited the same activation pattern as the right FPC (actor, first- and second- alternative reliability: all Ts > 1.99, Ps < 0.053). Accordingly, the less strategies were eligible as actor, the more their reliability elicited FPC detrimentally to vmPFC/pgACC activations. vmPFC/pgACC and left FPC activations were also associated with feedback values (Ts > 2.43, Ps < 0.0195), from which strategies’ reliability is inferred. Using the same regression analysis, we next examined activations in switch-in, rejection and confirmation events associated with hypothesis testing. These algorithmic events elicited more posterior PFC activations. Medially, the dorsal ACC (dACC) responded selectively to switch-in events (Fig. 5A). Switch-in events elicited larger dACC responses than exploitation and exploration trials (both Ts > 3.59, Ps < 0.001) and than rejection and confirmation events (both Ts = 2.02, P = 0.05). These latter events elicited no significant dACC responses compared to exploitation and exploration trials (Ts < 2.02, Ps > 0.05). Confirmation events elicited only marginal dACC responses (T = 2.32, P = 0.03). Laterally, the left PFC (BA 45, mid-LPC) responded selectively to rejection events (Fig. 5B). Rejection events elicited larger mid-LPC activations than exploitation and exploration trials (both Ts > 4.53, Ps < 0.00006) and than switch-in and confirmation events (joint effect: T = 2.38, P = 0.022). These latter events elicited no significant mid-LPC responses (both Ts < 1.69, Ps > 0.10). Both the dACC and mid-LPC exhibited no differential responses between exploitation and exploration trials (Ts < 1) (Fig. 5, A and B) and no responses in the trials immediately preceding and following switch-in and rejection events (Fig. 6). Thus, dACC and mid-LPC responses to switch-in and rejection events, respectively, reflected the algorithmic transitions rather than the differential production of perseverative, exploratory vs. correct responses and associated cognitive states around these events. Furthermore, as both switchin and rejection events involve actor switching based on the same reliability threshold (=1/2), these differential activations could not simply reflect choice uncertainty and general inhibition/selection mechanisms across monitored strategies. Instead, these results indicate that the dACC / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 2 / 10.1126/science.1252254 detects when actors monitored in the pgACC/vmPFC become unreliable for triggering the creation of probe actors, while the mid-LPC detects when counterfactual strategies monitored in the FPC become reliable for retrieving them as actor. Only the ventral striatum responded selectively to confirmation events (Fig. 5C). Confirmation events elicited larger ventral-striatal activations than exploitation and exploration trials (both Ts > 3.59, Ps < 0.0009) and than switch-in and rejection events (both Ts > 2.99, Ps < 0.005). There were neither significant ventral-striatal responses to switch-in and rejection events compared to exploitation and exploration trials (all Ts < 1.99, Ps > 0.06), nor differential ventral-striatal responses between exploitation and exploration trials (T = 1.11, P = 0.27), nor significant ventral-striatal responses in the trials immediately preceding and following confirmation events (Fig. 6). The region concurrently responded to reward predictions errors: ventral striatal activations correlated both positively with feedback rewarding values (T = 5.04, P = 0.00002) and negatively with reward expectations (T = 4.25, P = 0.00013). Thus, beyond its involvement in actor reinforcement learning over trials (16), the ventral striatum exhibited additional responses in confirmation events. Because the vmPFC/pgACC projects to the ventral striatum (23) and encoded actor reliability, evidence is that the ventral striatum detects when newly created strategies driving behavior become reliable, presumably for confirming their storage in long-term memory. The dorsal striatum responded selectively to switch-in events (fig. S3), while bilateral posterior PFC (BA 44, post-LPC) and left premotor regions responded to both switch-in and confirmation events (fig. S4). These activations accord with the involvement of posterior frontalstriatal circuits in forming and storing action sets (18): dorsal- and ventral-striatal responses correlated with premotor/post-LPC responses in switch-in and confirmation events, respectively, when the algorithm created and confirmed probe actors in long-term memory (fig. S5). We found no other frontal and basal responses (p > 0.05, uncorrected) except bilateral responses to switch-in events in FPC regions reported above (fig. S3), likely reflecting that concomitant to probe actor creation, the former actor registers as an additional counterfactual strategy in the inferential buffer (more results in supplementary online text). Prefrontal Foundations of Human Reasoning The predicted algorithmic transitions associated with hypothesis-testing and accounting for participants’ behavior occurred within the frontal lobes in the expected PFC and striatal regions. Moreover, the anterior PFC encoded the predicted absolute reliability signals associated with the concurrent behavioral strategies the algorithm creates, learns, tests and retrieves for driving action. These results support the hypothesis that the proposed algorithm describes reasoning PFC processes guiding adaptive behavior (supplementary online text). Accordingly, the frontal lobes implement two concurrent inferential tracks. First, a medial track comprising the vmPFC/pgACC, dACC and ventral striatum makes inferences about the actor strategy that through reinforcement learning, selects and learns the actions maximizing reward. While the vmPFC/pgACC infers the actor absolute reliability, the dACC detects when it becomes unreliable for triggering exploration, i.e. the formation of a new strategy from long-term memory to serve as actor. The ventral striatum then detects when this new actor strategy becomes reliable, thereby terminating exploration and confirming it in long-term memory. Second, a lateral track comprising the FPC and mid-LPC makes inferences about two/three alternative strategies stored in long-term memory. While the FPC concurrently infers the absolute reliability of these counterfactual strategies from action outcomes, the mid-LPC detects when one becomes reliable for retrieving it as actor. This medial-lateral segregation stems from the model core notion of absolute reliability, which yields to distinguish between switching away from ongoing behavior (the actor becomes unreliable) vs. switching to another behavioral strategy stored in long-term memory (one counterfactual strategy becomes reliable). In this protocol, the two events never coincided, which would have required alternating between only two recurrent situations associated with two distinct strategies (the actor unreliability then implies the reliability of the alternative strategy) (24). The dACC thus triggers switching away from ongoing behavior with the formation of new behavioral strategies, whereas the mid-LPC enables to switch to counterfactual strategies. The model may thus explain dACC activations observed in detecting unexpected action outcomes (25), switching to exploratory behaviors (26) and starting new behavioral tasks (27), and LPC activations in retrieving task-sets (15, 28). Consistent with the model prediction, moreover, the dACC and mid-LPC coactivate when participants switch back and forth between only two alternative behaviors (11). The model further indicates that the coupling between the medial and lateral track realizes hypothesis-testing bearing upon new behavioral strategies created from long-term memory. Serving as probe actor initially set as being unreliable, newly created strategies are disbanded when the mid-LPC detects one counterfactual strategy has become reliable for retrieving it as actor. However, the ventral striatum adjusts probe actors to external contingencies through reinforcement learning (16, 19, 20) and detects when probe actors eventually become reliable. In that event, the ventral striatum confirms probe actors in long-term memory as additional, subsequently recoverable strategies. The interplay between the dACC, mid-LPC and ventral striatum thus controls switches in and out of exploration periods corresponding to hypothesis testing on newly created strategies. Accordingly, every decision to create new strategies may be subsequently revised according to new information, which is critical in optimal adaptive processes operating in open-ended environments for dealing with the intrinsic non-parametric nature of strategy creation (4). Hypothesis testing derives from inferences about the absolute reliability of actor and two/three counterfactual strategies, which involved the vmPFC/pgACC and FPC, respectively. The dissociation supports the distinction between the notion of actor and counterfactual strategy and accords with the vmPFC/pgACC and FPC involvement in monitoring ongoing and unchosen courses of action, respectively (3, 11, 12, 29, 30). Strategy absolute reliability measures to which extent the strategy is applicable to the current situation, i.e. current external contingencies and those learned by the strategy result from the same latent cause. The vmPFC/pgACC thus infers to which extent the latent cause determining current action outcomes remains unchanged. The FPC infers to which extent the latter result from two/three previously identified latent causes. Latent causes are abstract constructs resulting from hypothesis-testing implemented through the interplay between the dACC, mid-LPC and ventral striatum. Latent causes organize long-term memory as a repertoire of behavioral strategies treated as separable entities. By detecting the reliability/unreliability of monitored strategies, the dACC, mid-LPC and ventral striatum then appear to implement true/false exclusive judgments about possible causes of observed contingencies for selecting appropriate behavioral strategies. The model thus describes how the prefrontal cortex forms a unified inferential system subserving reasoning in the service of adaptive behavior. Among the prefrontal regions, the FPC is likely specific to humans (31, 32), suggesting that the ability to jointly infer multiple possible causes of observed contingencies and consequently, to test new causal hypotheses emerging from long-term memory is unique to humans. References and Notes 1. M. Oaksford, N. Chater, Précis of bayesian rationality: The probabilistic approach to human reasoning. Behav. Brain Sci. 32, 69–84, discussion 85– 120 (2009). Medline doi:10.1017/S0140525X09000284 2. T. E. Behrens, M. W. Woolrich, M. E. Walton, M. F. Rushworth, Learning the / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 3 / 10.1126/science.1252254 value of information in an uncertain world. Nat. Neurosci. 10, 1214–1221 (2007). Medline doi:10.1038/nn1954 3. E. D. Boorman, T. E. Behrens, M. W. Woolrich, M. F. Rushworth, How green is the grass on the other side? Frontopolar cortex and the evidence in favor of alternative courses of action. Neuron 62, 733–743 (2009). Medline doi:10.1016/j.neuron.2009.05.014 4. Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei, Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2006). doi:10.1198/016214506000000302 5. F. Doshi-Velez, The infinite Partially Observable Markov Decision Process. Adv. Neural Inf. Process. Syst. 21, 477–485 (2009). 6. N. D. Daw, A. Courville, The pigeon as particle filter. Adv. Neural Inf. Process. Syst. 20, 369–376 (2007). 7. S. J. Gershman, D. M. Blei, Y. Niv, Context, learning, and extinction. Psychol. Rev. 117, 197 (2010). doi:10.1037/a0017808 8. A. Collins, E. Koechlin, Reasoning, learning, and creativity: Frontal lobe function and human decision-making. PLOS Biol. 10, e1001293 (2012). Medline doi:10.1371/journal.pbio.1001293 9. Materials and methods are available online as supplementary material on Science Online. 10. N. Cowan, in Human Learning and Memory, C. Izawa, N. Ohta, Eds. (Erlbaum, Mahwah, NJ, 2005), pp. 155–175. 11. A. N. Hampton, P. Bossaerts, J. P. O’Doherty, The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. J. Neurosci. 26, 8360–8367 (2006). Medline doi:10.1523/JNEUROSCI.1010-06.2006 12. E. Koechlin, A. Hyafil, Anterior prefrontal function and the limits of human decision-making. Science 318, 594–598 (2007). Medline doi:10.1126/science.1142995 13. E. D. Boorman, T. E. Behrens, M. F. Rushworth, Counterfactual choice and learning in a neural network centered on human lateral frontopolar cortex. PLOS Biol. 9, e1001093 (2011). Medline doi:10.1371/journal.pbio.1001093 14. E. Koechlin, C. Ody, F. Kouneiher, The architecture of cognitive control in the human prefrontal cortex. Science 302, 1181–1185 (2003). Medline doi:10.1126/science.1088545 15. K. Sakai, R. E. Passingham, Prefrontal interactions reflect future task operations. Nat. Neurosci. 6, 75–81 (2003). Medline doi:10.1038/nn987 16. J. O’Doherty, P. Dayan, J. Schultz, R. Deichmann, K. Friston, R. J. Dolan, Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 304, 452–454 (2004). Medline doi:10.1126/science.1094285 17. E. Koechlin, C. Summerfield, An information theoretical approach to prefrontal executive function. Trends Cogn. Sci. 11, 229–235 (2007). Medline doi:10.1016/j.tics.2007.04.005 18. D. Badre, A. S. Kayser, M. D’Esposito, Frontal cortex and the discovery of abstract action rules. Neuron 66, 315–326 (2010). Medline doi:10.1016/j.neuron.2010.03.025 19. W. Schultz, P. Dayan, P. R. Montague, A neural substrate of prediction and reward. Science 275, 1593–1599 (1997). Medline doi:10.1126/science.275.5306.1593 20. K. Doya, Reinforcement learning: Computational theory and biological mechanisms. HFSP J. 1, 30–40 (2007). Medline doi:10.2976/1.2732246/10.2976/1 21. J. J. Ribas-Fernandes, A. Solway, C. Diuk, J. T. McGuire, A. G. Barto, Y. Niv, M. M. Botvinick, A neural signature of hierarchical reinforcement learning. Neuron 71, 370–379 (2011). Medline doi:10.1016/j.neuron.2011.05.042 22. M. Esterman, B. J. Tamber-Rosenau, Y. C. Chiu, S. Yantis, Avoiding nonindependence in fMRI data analysis: Leave one subject out. Neuroimage 50, 572–576 (2010). Medline doi:10.1016/j.neuroimage.2009.10.092 23. G. E. Alexander, M. R. DeLong, P. L. Strick, Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci. 9, 357–381 (1986). Medline doi:10.1146/annurev.ne.09.030186.002041 24. S. Charron, E. Koechlin, Divided representation of concurrent goals in the human frontal lobes. Science 328, 360–363 (2010). Medline doi:10.1126/science.1183614 25. W. H. Alexander, J. W. Brown, Medial prefrontal cortex as an action-outcome predictor. Nat. Neurosci. 14, 1338–1344 (2011). Medline doi:10.1038/nn.2921 26. N. Kolling, T. E. Behrens, R. B. Mars, M. F. Rushworth, Neural mechanisms of foraging. Science 336, 95–98 (2012). Medline doi:10.1126/science.1216930 27. N. U. Dosenbach, K. M. Visscher, E. D. Palmer, F. M. Miezin, K. K. Wenger, H. C. Kang, E. D. Burgund, A. L. Grimes, B. L. Schlaggar, S. E. Petersen, A core system for the implementation of task sets. Neuron 50, 799–812 (2006). Medline doi:10.1016/j.neuron.2006.04.031 28. K. Sakai, Task set and prefrontal cortex. Annu. Rev. Neurosci. 31, 219–245 (2008). Medline doi:10.1146/annurev.neuro.31.060407.125642 29. B. De Martino, S. M. Fleming, N. Garrett, R. J. Dolan, Confidence in valuebased choice. Nat. Neurosci. 16, 105–110 (2013). Medline doi:10.1038/nn.3279 30. E. Koechlin, G. Basso, P. Pietrini, S. Panzer, J. Grafman, The role of the anterior prefrontal cortex in human cognition. Nature 399, 148–151 (1999). Medline doi:10.1038/20178 31. K. Teffer, K. Semendeferi, Human prefrontal cortex: Evolution, development, and pathology. Prog. Brain Res. 195, 191–218 (2012). Medline doi:10.1016/B978-0-444-53860-4.00009-X 32. F. X. Neubert, R. B. Mars, A. G. Thomas, J. Sallet, M. F. Rushworth, Comparison of human ventral frontal cortex areas for cognitive control and language with areas in monkey frontal cortex. Neuron 81, 700–713 (2014). Medline doi:10.1016/j.neuron.2013.11.012 33. E. T. Jaynes, Phys. Rev. Ser. 2 106, 620 (1957). 34. R. A. Rescorla, A. Wagner, R., in Classical Conditioning II, A.H. Black, W.F. Prokasy, Eds. (Appleton-Century-Crofts, New York, 1972), pp. 64-99. 35. A. J. Yu, P. Dayan, Uncertainty, neuromodulation, and attention. Neuron 46, 681–692 (2005). Medline doi:10.1016/j.neuron.2005.04.026 36. K. Doya, K. Samejima, K. Katagiri, M. Kawato, Multiple model-based reinforcement learning. Neural Comput. 14, 1347–1369 (2002). Medline doi:10.1162/089976602753712972 Acknowledgments: We thank Chris Summerfield and Stefano Palminteri for their helpful comments. Funded by a European Research Council Grant (ERC-2009-AdG#250106) to E.K. MRI data are available at central.xnat.org, project ID: PROBE. SUPPLEMENTARY MATERIALS www.sciencemag.org/cgi/content/full/science.1252254/DC1 Materials and Methods Supplementary Text Figs. S1 to S5 Tables S1 and S2 References (33–36) 14 February 2014; accepted 15 May 2014 Published online 29 May 2014 10.1126/science.1252254 / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 4 / 10.1126/science.1252254 Fig. 1. A model of human reasoning. Solid squares, behavioral strategies stored in long-term memory. λi, λj, λk, λp denote absolute reliabilities of monitored strategies inferred from action outcomes (here, the inferential capacity is three). Purple, actor strategy learning external contingencies and selecting action maximizing rewards. In exploitation periods, the actor is reliable (i.e., λactor > 1 – λactor or λactor > 1/2), the others being necessary unreliable (because ∑λ. ≤ 1). Otherwise, the system switches into exploration (all λ < 1/2) and creates a probe actor (p) from mixing strategies stored in long-term memory (blue). Exploration periods terminate, when either one counterfactual strategy (j) or probe actor (p) becomes reliable: the probe actor is then rejected (red) or confirmed (orange). See text for details. / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 5 / 10.1126/science.1252254 Fig. 2. Behavioral performances following episode changes. Proportion of correct (top) and exploratory (middle) responses from episode onsets (arrows). (Left) participants’ performances. (Right) Fitted model predictions in every trial given participants’ responses in previous trials. Bottom, statistical dependences between two successive correct responses produced by participants (left) and fitted model simulations (right) (mutual information computed over five-trials sliding windows). Green: open episodes; Blue: recurrent episodes. Error bars are SEM across participants. See table S1 for model parameters. *P < 0.05 / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 6 / 10.1126/science.1252254 Fig. 3. Behavioral performances according to predicted algorithmic transitions. Model predictions (bottom) and participants’ performances (top) realigned on switchin (left), rejection (right, red) and confirmation (right, orange) events occurring in the algorithm. Data points following switch-in events and preceding rejection and confirmation events included only exploration trials. Model predictions are computed in every trial given participants’ responses in previous trials. Green, open episodes. Blue, recurrent episodes. Green and blue shaded areas are centered on the average of episode onsets preceding switch-in events (width: standard deviations). Perseverative responses correspond to correct responses in preceding episodes. Error bars are SEM across participants. / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 7 / 10.1126/science.1252254 Fig. 4. Brain activations associated with reliability inferences. (Bottom) 3-D rendering of all brain activations correlating with actor reliability (magenta) and with first and second alternative reliability (cyan) (thresholded at P < 0.005 (voxel-wise, uncorrected) and P < 0.05 (cluster-wise) for display purpose). MNI coordinates of activation peaks are showed in brackets. (Bar graphs) partial correlation coefficients for feedback valence, actor, first- and secondalternative reliability averaged over activation clusters. White bars are for the symmetrical region of right FPC activations (left FPC). Error bars are SEM across participants. *P < 0.05. / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 8 / 10.1126/science.1252254 Fig. 5. Prefrontal and basal responses to predicted algorithmic transitions. Brain slices, activations in switch-in (blue), rejection (red) and confirmation (orange) events superimposed on anatomical templates (thresholded at P < 0.005 (voxel-wise, uncorrected) and P < 0.05 (cluster-wise) for display purpose). X, Y, Z are slice MNI coordinates corresponding to activation peaks (table S2). Graphs, Peri-feedback magnetic resonance responses to switch-in, rejection and confirmation events averaged over activation clusters and factoring out all other effects. Black lines are peri-feedback MR responses in exploitation (square) and exploration (losange) trials. Error bars are SEM across participants. / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 9 / 10.1126/science.1252254 Fig. 6. Prefrontal and striatal responses around algorithmic transitions. Magnetic resonance responses to feedbacks in dACC, midLPC and ventral striatum on trials preceding and following switch-in, rejection and confirmation events. Bars are partial correlation coefficients (betas) from the regression analysis described in the text and corresponding to event-related regressors modeling switch-in, rejection and confirmation events shifted 0, 1 or 2 trials preceding and following actual occurrences of these events. Error bars are SEM across subjects. Maximal and significant responses (when corrected for multiple comparisons around algorithmic events) were elicited only when the events occurred in the algorithm. / http://www.sciencemag.org/content/early/recent / 29 May 2014 / Page 10 / 10.1126/science.1252254