Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Overcoming Temptation: Incentive Design For Intertemporal Choice

ArXiv, 2022
Individuals are often faced with temptations that can lead them astray from long-term goals. We’re interested in developing interventions that steer individuals toward making good initial decisions and then maintaining those decisions over time. In the realm of financial decision making, a particularly successful approach is the prize-linked savings account: individuals are incentivized to make deposits by tying deposits to a periodic lottery that awards bonuses to the savers. Although these lotteries have been very effective in motivating savers across the globe, they are a one-size-fits-all solution. We investigate whether customized bonuses can be more effective. We formalize a delayed-gratification task as a Markov decision problem and characterize individuals as rational agents subject to temporal discounting, a cost associated with effort, and fluctuations in willpower. Our theory is able to explain key behavioral findings in intertemporal choice. We created an online delayed-......Read more
1 (2016) 1-48 Submitted 4/00; Published 10/00 Overcoming Temptation: Incentive Design For Intertemporal Choice Michael C. Mozer Michael.Mozer@colorado.edu Shruthi Sukumar Shruthi.Sukumar@colorado.edu Camden Elliott-Williams Camden.ElliottWilliams@colorado.edu Department of Computer Science University of Colorado Boulder, CO 80309-0430, USA Shabnam Hakimi Shabnam.Hakimi@colorado.edu Institute of Cognitive Science University of Colorado Boulder, CO 80309-0345, USA Adrian F. Ward Adrian.Ward@mccombs.utexas.edu McCombs School of Business University of Texas Austin, TX 78705, USA Editors: Abstract Individuals are often faced with temptations that can lead them astray from long-term goals. We’re interested in developing interventions that steer individuals toward making good initial decisions and then maintaining those decisions over time. In the realm of financial decision making, a particularly successful approach is the prize-linked savings account: individuals are incentivized to make deposits by tying deposits to a periodic lottery that awards bonuses to the savers. Although these lotteries have been very eective in motivating savers across the globe, they are a one-size-fits-all solution. We investigate whether customized bonuses can be more eective. We formalize a delayed-gratification task as a Markov decision problem and characterize individuals as rational agents subject to temporal discounting, costs associated with eort, and moment-to-moment fluctuations in willpower. Our theory is able to explain key behavioral findings in intertemporal choice. We created an online delayed-gratification game in which the player scores points by choosing a queue to wait in and patiently advancing to the front. Data collected from the game is fit to the model, and the instantiated model is then used to optimize predicted player performance over a space of incentives. We demonstrate that customized incentive structures can improve goal-directed decision making. Should you go hiking today or work on that manuscript? Should you have a slice of cake or stick to your diet? Should you upgrade your flat-screen TV or contribute to your retirement account? Individuals are regularly faced with temptations that lead them astray from long-term goals. These temptations all reflect an underlying challenge in behavioral control that involves choosing between actions leading to small but immediate rewards and actions leading to large but delayed rewards. We introduce a formal model of this delayed gratification decision task, extending the Markov decision framework to incorporate the c 2016 Michael C. Mozer, Shruthi Sukumar, Camden Elliott-Williams, Shabnam Hakimi, and Adrian Ward. DRAFT: MANUSCRIPT IN PREPARATION
Mozer, Sukumar, Elliott-Williams, Hakimi, Ward psychological notion of willpower, and using formal models to optimize behavior by designing incentives to assist individuals in achieving long-term goals. Consider the serious predicament with retirement planning in the United States. Only 55% of working-age households have retirement account assets—whether an employer-sponsored plan or an IRA—and the median account balance for near-retirement households is $14,500. Even considering households’ net worth, 2/3 fall short of conservative savings targets based on age and income (Rhee and Boivie, 2015). Furthermore, 40% of every dollar contributed to the accounts of savers under age 55 simultaneously flows out of the retirement systems, not counting loans to oneself (Argento et al., 2015). In 2013, the US government and nonprofits spent $670M on financial education, yet financial literacy accounts for a miniscule 0.1% of the variance in financial outcomes (Fernandes et al., 2014). One technique that has been extremely successful in encouraging savings, primarily in Europe and the developing world but more recently in the US as well, is the prize linked savings account (PLSA) (Kearney et al., 2010). The idea is to pool a fraction of the interest from all depositors to fund a prize awarded by periodic lotteries. Just as ordinary lotteries entice individuals to purchase tickets, the PLSA encourages individuals to save. Disregarding the fact that lotteries function in part because individuals overvalue low-probability gains (Kahneman and Tversky, 1979), the core of the approach is to oer savers the prospect of short-term payos in exchange for them committing to the long term. Although the account yields a lower interest rate to fund the lottery, the PLSA increases the net expected account balance due to greater commitment to participation. The PLSA is a one-size-fits-all solution. A set of incentives that that work well for one individual or one subpopulation may not be optimal for another. In this article, we investigate approaches to customizing incentives to an individual or a subpopulation with the aim of achieving greater adherence to long-term goals and ultimately, better long-term outcomes for the participants. Our approach involves: (1) building a model to characterize the behavior of an individual or group, (2) fitting the model with behavioral data, (3) using the model to determine an incentive structure that optimizes outcomes, and (4) validating the model by showing better outcomes with model-derived incentives than with alternative incentive structures. 1. Intertemporal Choice Intertemporal choice involves decisions that produce gains and losses at dierent points in time. How an individual interprets delayed consequences influences the utility or value associated with a decision. When consequences are discounted with the passage of time, decision making is biased toward more immediate gains (and more distant losses). The delay discounting task is often used to study intertemporal choice (Green and Myerson, 2004). Individuals are asked to choose between two alternatives, e.g., $1 today versus $X in Y days. By identifying the X that yields subjective indierence for a given Y , one can estimate an individual’s discounting of future outcomes. Discount rates vary across individuals yet show stability over extended periods of time (Kirby, 2009). This paradigm involves a single, hypothetical decision and reveals the intrinsic future value of an outcome. However, it does not address the temporal dynamics of behavior during a delay period. Once an initial decision is made to wait for a large reward, some scenarios 2
1 (2016) 1-48 Submitted 4/00; Published 10/00 DRAFT: MANUSCRIPT IN PREPARATION Overcoming Temptation: Incentive Design For Intertemporal Choice Michael C. Mozer Shruthi Sukumar Camden Elliott-Williams Michael.Mozer@colorado.edu Shruthi.Sukumar@colorado.edu Camden.ElliottWilliams@colorado.edu Department of Computer Science University of Colorado Boulder, CO 80309-0430, USA Shabnam Hakimi Shabnam.Hakimi@colorado.edu Institute of Cognitive Science University of Colorado Boulder, CO 80309-0345, USA Adrian F. Ward Adrian.Ward@mccombs.utexas.edu McCombs School of Business University of Texas Austin, TX 78705, USA Editors: Abstract Individuals are often faced with temptations that can lead them astray from long-term goals. We’re interested in developing interventions that steer individuals toward making good initial decisions and then maintaining those decisions over time. In the realm of financial decision making, a particularly successful approach is the prize-linked savings account: individuals are incentivized to make deposits by tying deposits to a periodic lottery that awards bonuses to the savers. Although these lotteries have been very effective in motivating savers across the globe, they are a one-size-fits-all solution. We investigate whether customized bonuses can be more effective. We formalize a delayed-gratification task as a Markov decision problem and characterize individuals as rational agents subject to temporal discounting, costs associated with effort, and moment-to-moment fluctuations in willpower. Our theory is able to explain key behavioral findings in intertemporal choice. We created an online delayed-gratification game in which the player scores points by choosing a queue to wait in and patiently advancing to the front. Data collected from the game is fit to the model, and the instantiated model is then used to optimize predicted player performance over a space of incentives. We demonstrate that customized incentive structures can improve goal-directed decision making. Should you go hiking today or work on that manuscript? Should you have a slice of cake or stick to your diet? Should you upgrade your flat-screen TV or contribute to your retirement account? Individuals are regularly faced with temptations that lead them astray from long-term goals. These temptations all reflect an underlying challenge in behavioral control that involves choosing between actions leading to small but immediate rewards and actions leading to large but delayed rewards. We introduce a formal model of this delayed gratification decision task, extending the Markov decision framework to incorporate the c 2016 Michael C. Mozer, Shruthi Sukumar, Camden Elliott-Williams, Shabnam Hakimi, and Adrian Ward. Mozer, Sukumar, Elliott-Williams, Hakimi, Ward psychological notion of willpower, and using formal models to optimize behavior by designing incentives to assist individuals in achieving long-term goals. Consider the serious predicament with retirement planning in the United States. Only 55% of working-age households have retirement account assets—whether an employer-sponsored plan or an IRA—and the median account balance for near-retirement households is $14,500. Even considering households’ net worth, 2/3 fall short of conservative savings targets based on age and income (Rhee and Boivie, 2015). Furthermore, 40% of every dollar contributed to the accounts of savers under age 55 simultaneously flows out of the retirement systems, not counting loans to oneself (Argento et al., 2015). In 2013, the US government and nonprofits spent $670M on financial education, yet financial literacy accounts for a miniscule 0.1% of the variance in financial outcomes (Fernandes et al., 2014). One technique that has been extremely successful in encouraging savings, primarily in Europe and the developing world but more recently in the US as well, is the prize linked savings account (PLSA) (Kearney et al., 2010). The idea is to pool a fraction of the interest from all depositors to fund a prize awarded by periodic lotteries. Just as ordinary lotteries entice individuals to purchase tickets, the PLSA encourages individuals to save. Disregarding the fact that lotteries function in part because individuals overvalue low-probability gains (Kahneman and Tversky, 1979), the core of the approach is to offer savers the prospect of short-term payoffs in exchange for them committing to the long term. Although the account yields a lower interest rate to fund the lottery, the PLSA increases the net expected account balance due to greater commitment to participation. The PLSA is a one-size-fits-all solution. A set of incentives that that work well for one individual or one subpopulation may not be optimal for another. In this article, we investigate approaches to customizing incentives to an individual or a subpopulation with the aim of achieving greater adherence to long-term goals and ultimately, better long-term outcomes for the participants. Our approach involves: (1) building a model to characterize the behavior of an individual or group, (2) fitting the model with behavioral data, (3) using the model to determine an incentive structure that optimizes outcomes, and (4) validating the model by showing better outcomes with model-derived incentives than with alternative incentive structures. 1. Intertemporal Choice Intertemporal choice involves decisions that produce gains and losses at different points in time. How an individual interprets delayed consequences influences the utility or value associated with a decision. When consequences are discounted with the passage of time, decision making is biased toward more immediate gains (and more distant losses). The delay discounting task is often used to study intertemporal choice (Green and Myerson, 2004). Individuals are asked to choose between two alternatives, e.g., $1 today versus $X in Y days. By identifying the X that yields subjective indifference for a given Y , one can estimate an individual’s discounting of future outcomes. Discount rates vary across individuals yet show stability over extended periods of time (Kirby, 2009). This paradigm involves a single, hypothetical decision and reveals the intrinsic future value of an outcome. However, it does not address the temporal dynamics of behavior during a delay period. Once an initial decision is made to wait for a large reward, some scenarios 2 Overcoming Temptation permit an individual to abandon the decision at any instant in favor of the small immediate reward. For example, in the classic marshmallow test (Mischel and Ebbesen, 1970), children are seated at a table with a single marshmallow. They are allowed to eat the marshmallow, but if they wait while the experimenter steps out of the room, they will be offered a second marshmallow when the experimenter returns. In this delayed gratification task, children must continually contemplate whether to eat the marshmallow or wait for two marshmallows. Their behavior depends not only on the hypothetical discounting of future rewards but on an individual’s willpower—their ability to maintain focus on the larger reward and not succumb to temptation before the experimenter returns. Defection at any moment eliminates the possibility of the larger reward. The marshmallow test achieved renown not only because it turns out to be predictive of later life outcomes (Mischel et al., 1989), but because it is analogous to many situations involving delayed gratification. Like the marshmallow test, some of these situations have an unspecified time horizon (e.g., exercise, waiting for an elevator, spending during retirement). However, others have a known horizon (e.g., avoiding snacks before dinner, saving for retirement, completing a college degree). Our work addresses the case of a known or assumed horizon. Beyond whether the horizon is known or not, delayed-gratification tasks may also be characterized in terms of the number of opportunities to obtain the delayed reward. The marshmallow test is one shot, but many true-to-life scenarios have an iterative nature. For example, in retirement planning, the failure to contribute to the account one month does not preclude contributing the next month. Another intuitive example involves allocating time within a work day. One must choose between tasks that are relatively quick and provide a moment of satisfaction (e.g., answering email, cleaning a desk top) and those that are more effortful but also yield a greater sense of accomplishment (e.g., editing a paper for submission to a journal, reading a research article). Our work addresses both one-shot and iterated delayed-gratification tasks. For such tasks, we’re interested in developing personalized interventions that assist individuals both in making good initial decisions and in maintaining those decisions over time. 2. Theories of Intertemporal Choice Nearly all previous conceptualizations of intertemporal choice have focused on the shape of the discounting function and the initial ‘now versus later’ decision, not the time course. One exception is the work of McGuire and Kable (2013) who frame failure to postpone gratification as a rational, utility-maximizing strategy when the time at which future outcomes materialize is uncertain. Our theory is complementary in providing a rational account in the known time horizon situation. There is a rich literature on treating human decision making from the framework of Markov decision processes (MDP s; e.g., Shen et al., 2014; Niv et al., 2012), but this research does not directly address intertemporal choice. Kurth-Nelson and Redish (2010, 2012) have explored a reinforcement learning framework to model precommitment in decision making as a means of preventing impulsive defections. This interesting work focuses on the initial decision whether to precommit rather than the ongoing possibility of defection. To the best of our knowledge, we are the first to adopt an 3 Mozer, Sukumar, Elliott-Williams, Hakimi, Ward !$* (c) (a) 1a !$ 1 !' 2 3 ... !"#$ !'* !&& τa a !%%* !%% LL τ !$) !&& ... 2a 1b !&& !&& !') !%%) ... 2b LL τb b !%%( = PERSIST = DEFECT SS !$( 1c !'( 2c 3c !&& (b) ... !&& τc c !&& !&& SS !$ 1 !' 2 3 !&& ... !&& !"#$ !%% LL τ !&& !&& (d) !$ 1 !' 2 3 ... !"#$ !%% LL τ SS !&& !&& !&& SS3 SS2 !&& !&& SS4 ... !&& SS !&& Figure 1: Finite-state environment formalizing (a) the one-shot delayed-gratification task; (b) the iterated delayed-gratification task; (c) the iterated delayed-gratification task with variable delays and LL outcomes; and (d) an efficient approximation to the iterated delayed-gratification task, suitable when episodes are independent of one another. MDP perspective on intertemporal choice, a field which has relied primarily primarily on verbal, qualitative accounts. One challenge to modeling behavior with MDPs, whether in the framework of reinforcement learning or dynamic programming, is that it is mathematically convenient to assume exponential discounting, whereas studies of human intertemporal choice support hyperbolic discounting (Frederick et al., 2002). Kurth-Nelson and Redish (2010) have proposed a solution to this issue by exploiting the fact that a hyperbolic function can be well approximated by a mixture of exponentials. In the work we present, we assume exponential discounting, but our work could readily be extended in the same manner as Kurth-Nelson and Redish (2010). 3. Formalizing Delayed-Gratification Tasks as a Markov Decision Problem In this section, we formalize a delayed-gratification task as a Markov decision problem, which we will refer to as the DGMDP. We assume time to be quantized into discrete steps and we focus on situations with a known or assumed time horizon, denoted τ . At any step, the agent may defect and collect a small reward, or the agent may persist to the next step, eventually collecting a large reward at step τ . We use µSS and µLL to denote the smaller sooner (SS) and larger later (LL) rewards. Figure 1a shows a finite-state representation of the one-shot task with terminal states LL and SS that correspond to resisting and succumbing to temptation, respectively, and states for each time step between the initial and final times, 4 Overcoming Temptation t ∈ {1, 2, ..., τ }. Rewards are associated with state transitions. The possibility of obtaining intermediate rewards during the delay period is annotated via µ1:τ 1 ≡ {µ1 , ..., µτ 1 }, which we return to later. With exponential discounting, rewards n steps ahead are devalued by a factor of γ n , 0 ≤ γ < 1. Given the DGMDP, an optimal decision sequence is trivially obtained by value iteration. However, this sequence is a poor characterization of human behavior. With no intermediate rewards (µ1:τ 1 = 0), it takes one of two forms: either the agent defects at t = 1 or the agent persists through t = τ . In contrast, individuals will often persist some time and then defect, and when placed into the same situation repeatedly, behavior is nondeterministic. For example, replicability on the marshmallow test is quite modest (ρ < 0.30: Mischel et al., 1988). The discrepancy between human delayed-gratification behavior and the optimal decisionmaking framework might indicate an incompatibility. However, we prefer a bounded rationality perspective on human cognition according to which behavior is cast as optimal but subject to cognitive constraints. We claim two specific constraints. 1. Individuals exhibit moment-to-moment fluctuations in willpower based on factors such as sleep, hunger, mood, etc. Low willpower causes an immediate reward to seem more tempting, and high willpower, less tempting. We characterize willpower as a one-dimensional Gaussian process, W = {Wt }, with w1 ∼ Gaussian(0, σ12 ) and wt ∼ Gaussian(wt 1 , σ 2 ). We suppose that willpower modulates an individual’s subjective value of defecting at step t: Q(t, w; defect) = µSS − w, (1) where Q(s; a) denotes the value associated with performing action a in state s, and the state space consists of the discrete step t and the continuous willpower level w. 2. Behavioral, economic, and neural accounts of decision making suggest that effort carries a cost, and that rewards are weighed against the effort required to obtain it (e.g., Kivetz, 2003). This notion is incorporated into the model via an effort cost, µE associated with persevering: ( µE + µt + γ EWt+1 |Wt =w V (t + 1, wt+1 ) for t < τ Q(t, w; persist) = (2) µLL for t = τ where V (t, w) ≡ maxa Q(t, w; a). (3) With these two constraints, we will show that the model not only has adequate expressive power to fit behavioral data, but also has the explanatory power to predict experimental outcomes. The one-shot DGMDP in Figure 1a can be extended to model the iterated task (Figure 1b), even when there is variability in the reward (µLL ) or duration (τ ) across episodes (Figure 1c).1 1. Figures 1b,c describe an indefinite series of episodes. If the total number of episodes or steps is constrained, as in any realistic scenario (e.g., an individual has eight hours in the work day to perform tasks like answering email), then the state must be augmented with a representation of remaining time. We dodge this complication by modeling situations in which the ‘end game’ is not approaching, e.g., only the first half of a work day. 5 Mozer, Sukumar, Elliott-Williams, Hakimi, Ward Finally, it is straightforward to show that the solution to the iterated DGMDPs in Figures 1b or 1c is identical to the solution to the simpler and more tractable one-shot DGMDP in Figure 1d under certain constraints (see Supplementary Materials). Essentially, Figure 1d models the choice between the LL reward or a sequence of SS rewards matched in total number of steps, effectively comparing the reward rates for LL and SS, the critical variables in the iterated DGMDP. To summarize, we have formalized one-shot and iterative delayed-gratification task with known horizon as a Markov decision problem with parameters Θtask ≡ {τ, µSS , µLL , µ1:τ 1 }, and a constrained rational agent parameterized by Θagent ≡ {γ, σ1 , σ, µE }. We now turn to solving the DGMDP and characterizing its properties. 3.1 Solving The Delayed-Gratification Markov Decision Problem (DGMDP) The simple structure of the environment allows for a backward-induction solution to the Bellman equation (Equation 2). Although the continuous willpower variable precludes an analytical solution for V (t, w), we construct a piecewise linear approximation (PLA) over w for each step t. To justify the PLA, consider the shape of V (t, w). With high willpower (w → ∞), the agent almost certainly persists to the LL reward and the function asymptotes at the discounted µLL . With low willpower (w → −∞), the agent almost certainly defects and the function approaches µSS − w. Thus, both extrema of the value function are linear with known slope and intercept. At step τ , these two linear segments exactly define the value function. At t < τ , there is an intermediate range within which small fluctuations in willpower can influence the decision and the expectation in Equation 2 yields a weighted mixture of the two extrema, which is well fit by a single linear segment—defined by its slope at and intercept bt . With V (t, w) expressed as a PLA, the expectation in Equation 2 becomes: EWt |Wt 1 =w V (t, wt ) =Φ zt (µSS − w) + Φ zt+ − Φ zt + 1−Φ zt+ (bt + at w) ct + σφ(zt ) + σat φ(zt ) − φ(zt+ ) , (4) where Φ(.) and φ(.) are the cdf and pdf of a standard normal distribution, respectively, and the standardized segment boundaries are zt = σ 1 [(µSS − bt )/(at + 1) − w] and zt+ = σ 1 [(ct − bt )/at −w]. The backup is seeded with zτ = zτ+ = σ 1 (µSS −µLL −w) and aτ = bτ = cτ = µLL . After each backup step, a Levenberg-Marquardt nonlinear least squares fit obtains at 1 and bt 1 ; ct 1 —the value of steadfast persistence—is obtained by propagating the discounted reward for persistence: ct 1 = µE + µt + γct . Figure 2a shows the value as a function of willpower at each step of an eight step DGMDP with an LL reward twice that of the SS reward, like the canonical marshmallow test. Both the exact value-function formulation (Equation 4) and the corresponding PLA are presented in colored and black lines, respectively. To ensure accuracy of the estimate and to eliminate an accumulation of estimation errors, we have also used a fine piecewise constant approximation in the intermediate region, yet the model output is almost identical. Using the value function, we can characterize the agent’s behavior in the DGMDP via the likelihood of defecting at various steps. With D denoting the defection step, we have the hazard probability ht ≡ P (D = t|D ≥ t) ≡ P (Wt < wt⇤ |W1 ≥ w1⇤ , ..., Wt 6 1 ≥ wt⇤ 1 ), (5) Overcoming Temptation 0.35 t Hazard Probability (h ) 2.5 Steps to Goal: 1 V(t,w) 2 2 3 4 1.5 1 -2 (a) 5 6 7 8 = 2 SS = 3 SS LL LL 0.3 0.25 0.2 0.15 0.1 0.05 0 -1 0 1 1 (b) Willpower (w) 2 3 4 5 6 7 8 Steps to Goal Figure 2: (a) Value function for a DGMDP with τ = 8, σ = .25, σ1 = .50, γ = .92, µE = µt = 0, µLL = 2, µSS = 1, exact (colored curves) and piecewise linear approximation (black lines). (b) Hazard functions for the parameterization in (a) (solid blue curve), with a higher level of LL reward (red curve), and with a shorter delay period, τ = 6 (dashed blue curve). where w⇤ is the willpower threshold that yields action indifference, Q(t, w⇤ ; defect) = Q(t, w⇤ ; persist). To represent the posterior distribution over willpower at each nondefection step, we initially used a particle filter but found a computationally more efficient and stable solution with quantile-based samples. We approximate the W1 prior and ∆W with discrete, equal probability q-quantiles. We reject values for which defection occurs, and then propagate Wt+1 = Wt + ∆W which results in up to q 2 samples, which we thin back to q-quantiles at each step. Using q = 1000 produces nearly identical results to selecting a much higher density of samples. The solid blue curve in Figure 2b shows the hazard function for the DGMDP in Figure 2a. Defection rates drop as the agent approaches the goal. Defection rates also scale with the LL reward, as illustrated by the contrast between the solid blue and red curves. Finally, defection rates depend both on relative and absolute steps to goal: contrasting the solid and dashed blue curves, corresponding to τ = 8 and τ = 6, respectively, the defection rate at a given number of steps from the goal depends on τ . We will shortly show that human behavioral data exhibit this same qualitative property. Interestingly, the correlation in willpower from one step to the next is critical in obtaining this property. When willpower is independent from step to step, i.e., wt ∼ Gaussian(0, σ 2 ), defection rates depend only on absolute steps to goal. Thus, moment-to-moment correlation in willpower is essential for modeling human behavior. 3.2 Behavioral Phenomena Explained We consider the solution of the DGMDP as a rational theory of human cognition. It is meant to explain both an individual’s initial choice (“Should I open a retirement account?”) as well as the temporal dynamics of sustaining that choice (“Should I withdraw the funds to buy a car?”) 7 Mozer, Sukumar, Elliott-Williams, Hakimi, Ward Our theory explains two key phenomena in the literature. First, failure on a DG task is sensitive to the relative magnitudes of the SS and LL rewards (Mischel, 1974). Figure 2b presents hazard functions for two reward magnitudes. The probability of obtaining the LL reward is greater with µLL /µSS = 3 than with µLL /µSS = 2. Figure 2b can also accommodate the finding that environmental reliability and trust in the experimenter affect outcomes in the marshmallow test (Kidd et al., 2012): in unreliable or nonstationary environments, the expected LL reward is lower than the advertised reward, and the DGMDP is based on reward expectations. Second, a reanalysis of data from a population of children performing the marshmallow task shows a declining hazard rate over the task period of 7 minutes (McGuire and Kable, 2013). The rapid initial drop in the empirical curve looks remarkably like the curves in Figure 2b. One might interpret this phenomenon as a finish-line effect: the closer one gets to a goal, the greater is the commitment to achieve the goal. However, the model suggests that this behavior arises not from abstract psychological constructs but because of correlations in willpower over time: if an individual starts down the path to an LL reward, the individual’s willpower at that point must be high. The posterior willpower distributions reflect the elimination of individuals with low momentary willpower, which contributes to the declining hazard rate. Also contributing is the exponential increase in value of the discounted LL reward as the agent advances through the DGMDP. McGuire and Kable (2013) explain the empirical hazard function via a combination of uncertainty in the time horizon and time-fluctuating discount rates. Our theory shows that these strong assumptions are not necessary, and our theory can address situations with a well delineated horizon such as retirement saving. Additionally, our theory aims to move beyond population data and explain the granular dynamical behavior of an individual. 4. Optimizing Incentives With a computational theory of the DG task in hand, we now explore a mechanism-design approach (Nisan and Ronen, 1999) aimed at steering individuals toward improved long-term outcomes. We ask whether we can provide incentives to rational value-maximizing agents that will increase their expected reward subject to constraints on the incentives. We focus on an investment scenario roughly analogous to a prize-linked savings account (PLSA). Suppose an individual has x dollars which they can deposit into a bank account earning interest at rate r, compounded annually. At the start of each year, they decide whether to continue saving (persist) or to withdraw and spend their entire savings with interest accumulated thus far (defect).2 Our goal to assist them in maximizing the profit they reap over τ − 1 years from their initial investment. Our incentive mechanism is a a schedule of lotteries. We refer to expected lottery distributions as bonuses, even though they are funded through the interest earned by a population of individuals, like the prizes of the PLSA. With µt denoting the bonus awarded in year t and µ1:τ 1 denoting the set of scheduled bonuses, our goal as mechanism designers is to identify the schedule that maximizes the 2. Although this all-or-none withdrawal of savings is not entirely realistic, it reduces the decision space to correspond with the FSM in Figure 1a. Were we to allow intermediate levels of withdrawal, the simulation would yield intermediate benefits of incentives. 8 γ γ γ γ γ γ 60 1:1000 200 1:100 150 1:10 1:0 no bonus 100 (a) 50 40 30 20 10 asymptotic SS 0.6 0.7 0.8 0.9 Discount Parameter (γ) = 0.55 = 0.63 = 0.70 = 0.75 = 0.85 = 0.95 Expected Payoff Per Step 70 250 asymptotic LL Optimal Bonus Expected Total Payoff Overcoming Temptation 0 (b) 9 8 7 6 5 4 3 2 1 asymptotic LL 150 1:10 1:0 130 no bonus 120 110 100 (c) Investment Year 1:1000 1:100 140 asymptotic SS 0.6 0.7 0.8 0.9 Discount Parameter ( ) Figure 3: Bonus optimization for an agent with σ1 = 50, σ = 30, µE = 0, and γ ∈ [0.55, 0.95]. (a) Expected payoff for the one-shot DGMDP for various bonus scenarios, including no bonus and optimal bonuses with lottery odds 1:0, 1:10, 1:100, and 1:1000. In these simulations, the interest-accrual scheme is used to constrain bonuses and payoffs. (b) Optimal bonus amounts at each step for various γ and lottery 1:0 (certain win), on the scale of an x = 100 initial pool of funds. (c) Expected payoff per time step for the iterated DGMDP for various bonus scenarios. In these simulations, the bonus-limits scheme is used to constrain bonuses and payoffs. expected net accumulation from an individual’s investment: µ⇤1:τ 1 = argmaxµ1:τ 1 Pτ t=1 P (D = t|µ1:τ 1) h bt + Pt 1 t0 =1 i µ t0 , (6) where bt is the amount banked at the start of year t, with b1 = x and bt+1 = (1 + r)(bt − µt ), and D is the year of defection, where D = 1 represents immediate defection and D = τ represents the the account reaching maturity. Defection probabilities are obtained from the theory (Equation 5). To illustrate this approach, we conducted a simulation with γ ∈ [0.55, 0.95], τ = 10, r = 0.1, and x = 100, comparing an agent’s expected accumulation without bonuses and with optimal bonuses. Optimization is via direct search using the simplex algorithm over unconstrained variables pt ≡ logit(µt /bt ), representing the proportion of the bank being distributed as a bonus. We first consider the case of deterministic bonuses: the agent receives bonus µt in year t with certainty. Figure 3a shows the expected payoff as a function of an agent’s discount factor γ for the scenario with no bonuses (purple curve) versus optimal bonuses awarded with probability 1.0 (light blue curve, labeled with the odds of a bonus being awarded, ‘1:0’). For reference, the asymptotic SS and LL payoffs are shown with dotted and dashed lines, respectively. With high discounting, this simulation yields a modest (∼ 10%) improvement in an individual’s expected accumulation by providing bonuses at the end of the early years and going into the final year (Figure 3b). Bonuses are recommended only when the gain from encouraging persistence beats the loss of interest on an awarded bonus. With low discounting, the model optimization recommends no bonuses. Thus, the simulation recommends different incentives to individuals depending on their discount factors. 9 Mozer, Sukumar, Elliott-Williams, Hakimi, Ward Now consider a lottery such as that conducted for the PLSA. If individuals operate based on expected returns, an uncertain lottery with odds 1:α and payoff (α + 1)µt would be equivalent to a certain payoff of µt . However, as characterized by prospect theory Kahneman and Tversky (1979), individuals overweight low probability events. Using median parameter estimates from cumulative prospect theory Tversky and Kahneman (1992) to infer subjective probabilities on lotteries with 1:10, 1:100, and 1:000 odds, we optimize bonuses for these cases.3 As depicted by the three upper curves in Figure 3a, lotteries such as the PLSA can significantly boost the benefit of incentive optimization. Lotteries and interest accrual are not suitable for all delayed-gratification tasks. For instance, one would not wish to encourage a dieter by offering a lottery for a 50-gallon tub of ice cream or the promise of a massive all-one-can-eat desert buffet at the conclusion of the diet. To demonstrate the flexibility of our framework, we posit a bonus-limit scheme as an alternative to the interest-accrual scheme in which up to nb bonuses of fixed size can be awarded and the optimization determines the time steps at which they are awarded. We conducted a simulation with the iterated DGMDP (Figure 1d) using γ ∈ [0.55, 0.95], τ = 10, awarding of nb ≤ 4 bonuses each of value 50, µSS = 100, and µLL = 150τ − 50nb . Multiple bonuses could be awarded in the same step, but bonuses were limited such that no defection could achieve a reward rate greater than µSS . This set up anticipates human experiments that we report later in the article. Figure 3c shows expected payoff per step, ranging from 100 from the SS reward to 150 for the LL reward, for the no-bonus condition (purple curve) and conditions with lotteries having odds 1:0, 1:10, 1:100, and 1:1000. As with the alternative DGMDP formulation with a single-shot task and the interest-based framework, optimization of bonuses achieves benefits which depend on γ and lottery odds. 5. Experiments We have argued that our modeling framework is flexible enough to describe a variety of delayed gratification tasks, both one shot and interative, with variable payoff and incentive structures. This framework provides a potential explanation of human cognition, under the conjecture that individuals can be cast as bounded rational agents who seek to maximize their payoffs given cognitive constraints such as discounting and fluctuations in willpower. If this conjecture is supported, the framework should allow us to determine incentives that will shape behavioral outcomes. Typically, support for a model is obtained by comparing it to alternatives and arguing that one model is better on grounds of parsimony or predictive power. With no existing models suited to explaining the moment-to-moment dynamics of behavior, our strategy instead is to show first that the model is consistent with behavior by fitting model parameters to behavioral data, and second, that the fitted, fully constrained model can make strong predictions concerning the outcomes of subsequent experiments. To collect behavioral data, we created a simple online delayed-gratification game in which players score points by waiting in a queue, much as diners score delicious foods by waiting their turn at a restaurant buffet (Figure 4a). The upper queue is short, having only one 3. According to prospect theory, the 1:10, 1:100, and 1:1000 lotteries yield overweighting by factors of 1.86, 5.50, and 14.40, respectively. 10 Overcoming Temptation (a) (b) Figure 4: The queue-waiting game. (a) The player (red icon) is in the vestibule, prior to choosing a queue. Queues advance right to left. Points awarded per queue are displayed left of the queue. (b) A snapshot of the game taken while the queues advance. As described in the text, this condition includes bonuses at certain positions in the long queue. Point increments are flashed as they are awarded. position, and delivers a 100 point reward when the player is serviced. The lower queue is long, having τ positions, and delivers a 100τ ρ point reward when the player is serviced. The reward-rate ratio, ρ, is either 1.25 or 1.50 in our experiments. The player starts in a vestibule (right side of screen) and selects a queue with the up and down arrow keys. The game updates at 2000 msec intervals, at which point the player’s request is processed and the queues advance (from right to left). Upon entering the short queue, the player is immediately serviced. Upon entering the long queue, the player immediately advances to the next-to-last position as the queue shuffles forward. With every tick of the game clock, the player may hit the left-arrow key to advance in the long queue or the up-arrow key to defect to the short queue. If the player takes no action, the simulated participants behind the player jump past. When the player defects to the short queue, the player is immediately serviced. When points are awarded, the screen flashes the points and a cash register sound is played, and the player returns to the vestibule and a new episode begins. In our initial experiments, the long-queue length τ is uniformly drawn from {4, 6, 8, 10, 12, 14} for each epsiode. Note that the reward rate (points per action) for either queue does not depend on the long-queue length. Because of this constraint, each episode is functionally decoupled from following episodes. That is, the optimal action for the current episode will not depend on upcoming episodes.4 Due to this fact and the time-constrained nature of the game, the iterated DGMDP in Figure 1d is appropriate for describing a rational player’s understanding of the game. This DGMDP focuses on reward rate and treats a defection as if the player continues to defect until τ steps are reached, each step delivering the small reward. In contrast to Figure 1c, Figure 1d is not concerned with the interdependence of episodes. The vestibule in Figure 4a corresponds to state 1 in Figure 1d and lower queue position closest to the service desk to state τ . Note the left-to-right reversal of the two Figures, which has often confused the authors of this article. Participants were recruited to play the game for five minutes via Amazon Mechanical Turk. In our analyses of player behavior, we remove the first and last thirty seconds of play. At the start, players are learning the game actions; at the end, players may not have sufficient time to traverse the long queue and defection is the optimal strategy. Participants are paid 4. A dependence does occur in the final seconds of the game, where the player may not have sufficient time to complete the long queue. We handle this case by discarding data toward the end of the game. 11 Mozer, Sukumar, Elliott-Williams, Hakimi, Ward $0.80 to play and are awarded a score-based bonus. They are required to perform at least one action every ten seconds or the experiment terminates and their data are rejected. 5.1 Experiment 1: Varying Reward Magnitude In Experiment 1, we manipulated the reward-rate ratio. Twenty different participants were tested for each ρ ∈ {1.25, 1.50}. Figure 5a shows the reward accumulation by individual participants in the two conditions as a function of time within the session. The two dashed black lines represent the reward that would be obtained by deterministically performing the SS or LL action at each tick of the game clock. (Participants are not required to act every tick, but they are warned after 7 sec and rejected after 14 sec if they fail to act.) The traces show that some participants had a strong preference for the short queue, others had a nearly perfect preference for the long queue, and still others alternated between strategies. The variability in strategy over time within an individual suggests that they did not simply lock into a fixed, deterministic action sequence. For each participant, each queue length, and each of the τ positions in a queue, we compute the fraction of episodes in which the participant defects at the given position. We average these proportions across participants and then compute empirical hazard curves. Figure 5b shows hazard curves for each of the six queue lengths and the two ρ conditions. The ρ = 1.50 curves are lighter and are offset slightly to the left relative to the ρ = 1.25 curves to make the pair more discriminable. The Figure presents both human data—asterisks connected by dotted lines—and simulation results—circles connected by solid lines. Focusing on the human data for the moment, initial-defection rates rise slightly with queue length and are greater for ρ = 1.25 than for ρ = 1.50. We thus see robust evidence that participants are sensitive to game conditions. To model the data, we set the DGMDP parameters (Θtask ) based on the game configuration. We obtain least-squares fits to the four agent parameters (Θagent ): discount rate γ = 0.957, initial and delta willpower spreads σ1 = 81.3, and σ = 21.3, and effort cost µE = −52.1. The latter three parameters can be interpreted using the scale of the SS reward, µSS = 100 points. Although the model appears to fit the pattern of data quite well, the model has four parameters and the data can essentially be characterized by four qualitative features: the mean rate of initial defection, the modulation of the initial-defection rate based on queue length and on ρ, and the curvature of the hazard function. The model parameters have no direct relationship to these features of the curves, but the model is flexible enough to fit many empirical curves. Consequently, we are cautious in making claims for the model’s validity based solely on the fit to Experiment 1. We note, however, that we investigated a variant of the model in which willpower is uncorrelated across steps, and it produces qualitatively the wrong prediction: it yields curves whose hazard probability depends only on the steps to the LL reward. In contrast, the curves of the correlated-willpower account depend primarily on the distance from the initial state, t, but secondarily on distance to the LL reward, τ − t. 5.2 Experiments 2 and 3: Modulating Effort To obtain additional support for the theory, we modified the queue-waiting game such that the player must work harder and experiences more frustration in reaching the front of the long queue. By increasing the required effort, we may test whether model parameters fit to 12 Overcoming Temptation Total Points 20k κ ! == 1.50 1.50 (c) κ ! ==1.25 1.25 Hazard Probability (a) 15k 10k 5k Hazard Probability 0.8 0.6 150 250 50 = 1.50) Simulation (! (κ=1.50) Human (! (κ=1.50) = 1.50) = 1.25) Simulation (! (κ=1.25) (! = 1.25) Human (κ=1.25) 6 4 6 4 150 6 8 10 12 14 4 0.6 0.4 250 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Position In Long Queue Time (Sec) (d) 1 8 10 10 12 12 14 14 8 0.4 0.2 0 0.8 0 50 Time (Sec) (b) 1 Simulation Human 0.2 Hazard Probability 0 1 0.8 4 6 8 10 12 14 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Position In Long Queue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Position In Long Queue Figure 5: (a) Game points accumulated by individual participants over time in Experiment 1. (b) Hazard curves in Experiment 1 for 6 line lengths and two reward-rate ratios. Human data shown with asterisks and dashed lines, model fits with circles and solid lines. (c) Hazard curves for Experiment 2, with only one free model parameter, (d) Hazard curves for Experiment 3, with no free model parameters. Experiment 1 will also fit new data, changing only the effort parameter, µE . To increase the required effort, the long queue advanced only every other clock tick in an apparently random fashion. Nonetheless, the player must press the advance key every tick to move with the queue, thus requiring exactly two keystrokes for each action in the game FSM (Figure 1d). The game clock in Experiment 2 updated every 1000 msec, twice the rate as in Experiment 1, and thus the overall timing was unchanged. We tested only the reward-rate ratio ρ = 1.50. Figure 5c shows hazard curves for Experiment 2. Using Experiment 1 parameter settings for γ, σ1 , and σ, we fit only the effort parameter, obtaining µE = −99.7, which is fortuitously twice the value obtained in Experiment 1. Model fits are superimposed over the human data. To further test the theory’s predictive power, we froze all four parameters and ran an Experiment 3 in which we introduced a smattering of 50 and 75 point bonuses along 13 Mozer, Sukumar, Elliott-Williams, Hakimi, Ward (b) early 130 Average Reward Rate (a) late 125 120 No Bonus Early Bonus Late Bonus Model 115 110 105 100 95 Weak Strong Population Figure 6: Experiment 4. (a) Model-predicted optimal bonus sequences—early (yellow) and late (blue) bonuses for weak and strong participants, respectively. (b) Average reward rate for weak and strong subpopulations and three bonus conditions. Error bars are ±1 SEM, corrected for between-subject variance (Masson and Loftus, 2003). the path to the LL reward, adjusting the front-of-queue reward such that the reward-rate ratio ρ = 1.50 was attained when traversing the entire queue (see example in Figure 4b). Using the fully constrained model from Experiment 2, the fit obtained for Experiment 3 was quite good (Figure 5d). The model might underpredict long-queue initial defections, but it captures the curvature of the hazard functions due to the presence of bonuses. 5.3 Experiment 4: Customized Bonuses In Experiment 4, we tested the effect of bonuses customized to a subpopulation. To situate this Experiment, we reviewed the Experiment 2 data to examine inter-participant variability. We stratified the 30 participants in Experiment 2 based on their mean reward rate per action. This measure reflects quality of choices and does not penalize individuals who are slow. With a median split, the weak and strong groups have average reward rates of 103 and 132, respectively. Theoretically, rates range from 0 (always switching between lines and never advancing) to 100 (deterministically selecting the short queue) to 150 (deterministically selecting the long queue). We fit the hazard curves of each group to a customized γ, leaving unchanged the other parameters previously tuned to the population. We obtained excellent fits to the distinctive hazard functions with γstrong = 0.999 and γweak = 0.875. We then optimized bonuses for each group for various line lengths. As in Figure 3c, we searched over a bonus space consisting of all arrangements of up-to four bonuses, each worth fifty points, allowing multiple bonuses at the same queue position.5 We subtracted 200 points from the LL reward, maintaining a reward-rate ratio of ρ = 1.50 for completing the long queue. We constrained the search such that no mid-queue defection strategy would lead to 5. We avoided the interest-accrual scheme for bonuses because it could lead to variable reward rates among episodes in an interated DGMDP, which would introduce dependencies that invalidate treating the iterated DGMDP in Figure 1c as equivalent to that in Figure 1d. 14 Overcoming Temptation ρ > 1. A brute-force optimization yields bonuses early in the queue for the weak group, and bonuses late in the queue for the strong group (Figure 6a). Experiment 4 tested participants on three line lengths—6, 10, and 14—and three bonus conditions—early, late, and no bonuses. (The no-bonus case was as in Experiment 2.) The 54 participants who completed Experiment 4 were median split into a weak and a strong group based on their reward rate on no-bonus episodes only. Consistent with the model-based optimization, the weak group performs better on early bonuses and the strong group on late bonuses (the yellow and blue bars in Figure 6b). Importantly, there is a 2 × 2 interaction between group and early versus late bonus (F (1, 51) = 11.82, p = .001) indicating a differential effect of bonuses on the two groups. Figure 6b also shows model predictions based the parameterization determined from Experiment 2. The model has a perfect rank correlation with the data, and correctly predicts that both bonus conditions will facilitate performance, despite the objectively equal reward rate in the bonus and no-bonus conditions. That bonuses should improve performance is nontrivial: the persistence induced by the bonuses must overcome the tendency to defect because the LL reward is lower (as we observed in Experiment 1 with ρ = 1.25 versus ρ = 1.50). 6. Discussion In this article, we developed a formal theoretical framework to modeling the dynamics of intertemporal choice. We hypothesized that the theory is suitable to modeling human behavior. We obtained support for the theory by demonstrating that it explains key qualitative behavioral phenomena (Section 2.2) and predicts quantitative outcomes from a series of behavioral experiments (Section 4). Although our first experiment merely suggests that the theory has the flexibility to fit behavioral data post hoc, each following experiment used parametric constraints from the earlier experiments, leading to strong predictions from the theory that match behavioral evidence. The theory allows us to design incentive mechanisms that steer individuals toward better outcomes (Section 3), and we showed that this idea works in practice for customizing bonuses to subpopulations playing our queue-waiting game. The theory and the behavioral evidence both show a non-obvious and non-intuitive statistical interaction between the subpopulations and various incentive schemes. Because the theory has just four free parameters, it is readily pinned down to make strong, make-or-break predictions. Furthermore, it should be feasible to fit the theory to individuals as well as to subpopulations. With such fits comes the potential for maximally effective, truly individualized approaches to guiding intertemporal choice. This research program is still far from demonstrating utility in incentivizing individuals to persevere toward long-term goals such as losing weight or saving for retirement. It remains unclear whether intertemporal choice on a long time scale will have the same dynamics as on the short time scale of our queue-waiting game. However, the finding that reward-seeking behavior on the time scale of eye movements can be related to reward-seeking behavior on the time scale of weeks and months (Shadmehr et al., 2010; Wolpert and Landy, 2012) leads us to hope for temporal invariance. 15 Mozer, Sukumar, Elliott-Williams, Hakimi, Ward Acknowledgments This research was supported by NSF grants DRL-1631428, SES-1461535, SBE-0542013, SMA-1041755, and seed summer funding from the Institute of Cognitive Science at the University of Colorado. We thank Ian Smith and Brett Israelson for design and coding of the experiments. References R. Argento, V. L. Bryant, and J. Sabelhaus. Early withdrawals from retirement accounts during the great recession. Contemporary Economic Policy, 33:1–16, 2015. D. Fernandes, J. G. Lynch, Jr., and R. G. Netemeyer. Financial literacy, financial education, and downstream financial behaviors. Management Science, 60:1861–1883, 2014. Shane Frederick, George Loewenstein, and Ted O’Donoghue. Time discounting and time preference: A critical review. Journal of Economic Literature, 40(2):351–401, June 2002. doi: 10.1257/002205102320161311. L. Green and J. Myerson. A discounting framework for choice with delayed and probabilistic rewards. Psychological Bulletin, 130:769–792, 2004. D. Kahneman and A. Tversky. Prospect theory: An analysis of decision under risk. Econometrica, 47:263–292, 1979. Melissa Schettini Kearney, Peter Tufano, Jonathan Guryan, and Erik Hurst. Making savers winners: An overview of prize-linked savings products. Working Paper 16433, National Bureau of Economic Research, October 2010. URL http://www.nber.org/ papers/w16433. C. Kidd, H. Palmeri, and R. N. Aslin. Rational snacking: Young children’s decision-making on the marshmallow task is moderated by beliefs about environmental reliability. Cognition, 126:109–114, 2012. doi: doi:10.1016/j.cognition.2012.08.004. K N Kirby. One-year temporal stability of delay-discount rates. Psychological Bulletin & Review, 16:457–462, 2009. R Kivetz. The effects of effort and intrinsic motivation on risky choice. Marketing Science, 22:477–502, 2003. Z. Kurth-Nelson and A. D. Redish. A reinforcement learning model of precommitment in decision making. Frontiers in Behavioral Neuroscience, 4, 2010. doi: http://doi.org/10. 3389/fnbeh.2010.00184. Z. Kurth-Nelson and A. D. Redish. Don’t let me do that! models of precommitment. Frontiers in Neuroscience, 6, 2012. doi: http://doi.org/10.3389/fnins.2012.00138. Michael EJ Masson and Geoffrey R Loftus. Using confidence intervals for graphically based data interpretation. Canadian Journal of Experimental Psychology, 57(3):203—220, 2003. 16 Overcoming Temptation J. T. McGuire and J. W. Kable. Rational temporal predictions can underlie apparent failure to delay gratification. Psychological Review, 120:395–410, 2013. W. Mischel. Processes in delay of gratification. Advances in Experimental Social Psychology, 7:249–292, 1974. doi: doi:10.1016/S0065-2601(08)60039-8. W. Mischel and E. B. Ebbesen. Attention in delay of gratification. Journal of Personality and Social Psychology, 16:329–337, 1970. W Mischel, Y Shoda, and P K Peake. The nature of adolescent competencies predicted by preschool delay of gratification. Journal of Personality & Social Psychology, 54:687–696, 1988. W Mischel, Y Shoda, and MI Rodriguez. Delay of gratification in children. Science, 244(4907):933–938, 1989. ISSN 0036-8075. doi: 10.1126/science.2658056. URL http: //science.sciencemag.org/content/244/4907/933. Noam Nisan and Amir Ronen. Algorithmic mechanism design (extended abstract). In Proceedings of the Thirty-first Annual ACM Symposium on Theory of Computing, STOC ’99, pages 129–140, New York, NY, USA, 1999. ACM. ISBN 1-58113-067-8. doi: 10.1145/ 301250.301287. URL http://doi.acm.org/10.1145/301250.301287. Y. Niv, J.A. Edlund, P. Dayan, and J.P. O’Doherty. Neural prediction errors reveal a risksensitive reinforcement-learning process in the human brain. The Journal of Neuroscience, 32:551–562, 2012. N. Rhee and I. Boivie. The continuing retirement savings crisis. Technical report, National Institute on Retirement Security, March 2015. URL http://www.nirsonline.org/ storage/nirs/documents/RSC%202015/final_rsc_2015.pdf. Reza Shadmehr, Jean Jacques Orban de Xivry, Minnan Xu-Wilson, and Ting-Yu Shih. Temporal discounting of reward and the cost of time in motor control. Journal of Neuroscience, 30(31):10507–10516, August 2010. Yun Shen, Michael J. Tobia, Tobias Sommer, and Klaus Obermayer. Risk sensitive reinforcement learning. Neural Computation, 26:1298–1328, 2014. A. Tversky and D. Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5:279–323, 1992. D. M. Wolpert and M. S. Landy. Motor control is decision making. Current Opinions in Neurobiology, 22:996–1003, 2012. 17 1 (2016) 1-48 Submitted 4/00; Published 10/00 Supplementary Materials Overcoming Temptation: Incentive Design For Intertemporal Choice Editors: Consider the value function for a special case where the willpower does not fluctuate, i.e., σ 2 = 0 and where intermediate rewards are not provided, i.e., µi = 0 for i ∈ {1...τ − 1}. In this case, I can show that the solution to the DGMDP in Figure 1b is identical to the solution to the DGMDP in Figure 1d. We need to extend this result to the following more general cases, roughly in order of challenge: • Allow for nonzero intermediate rewards • Allow for the case of Figure 1c where µLLa /τa = µLLb /τb for all a and b, • Allow for the case where σ 2 > 0 1. Proof of σ 2 = 0 and µi = 0 case In Figure 1b, the value of state 1 is defined by the Bellman equation as: V (1) = max(µSS + γV (1), γ τ 1 [µLL + γV (1)]) (1) We can solve for V (1) if the first term is larger: VSS (1) = 1 µSS . 1−γ (2) We can solve for V (1) if the second term is larger: VLL (1) = γτ 1 µLL . 1 − γτ Now consider Figure 1d, whose Bellman equation can be simplified to: ! τ 1 X i τ 1 V (1) = max γ µSS , γ µLL (3) (4) i=0 ◆ 1 − γτ τ 1 µSS , γ µLL = max 1−γ ✓ ◆ 1 γτ 1 τ = (1 − γ ) max µSS , µLL . 1−γ 1 − γτ ✓ c 2016 Michael C. Mozer, Shruthi Sukumar, Camden Elliott-Williams, Shabnam Hakimi, and Adrian Ward. (5) (6) Mozer, Sukumar, Elliott-Williams, Hakimi, Ward !$* (c) (a) 1a !$ 1 !' 2 3 ... !"#$ !'* !&& τa a !%%* !%% LL τ !$) !&& ... 2a 1b !&& !&& !') !%%) ... 2b LL τb b !%%( = PERSIST = DEFECT SS !$( 1c !'( 2c 3c !&& (b) ... !&& τc c !&& !&& SS !$ 1 !' 2 3 !&& ... !&& !"#$ !%% LL τ !&& !&& (d) !$ 1 !' 2 3 ... !"#$ !%% LL τ SS !&& !&& !&& SS3 SS2 !&& !&& SS4 !&& ... SS !&& Figure 1: Finite-state environment formalizing (a) the one-shot delayed-gratification task; (b) the iterated delayed-gratification task; (c) the iterated delayed-gratification task with variable delays and LL outcomes; and (d) an efficient approximation to the iterated delayed-gratification task, suitable when episodes are independent of one another. Note that the two terms inside the max function of Equation 6 are identical to the values in Equations 2 and 3, and thus the value functions for Figures 1b and 1d are identical up to a scaling constant. 2
Keep reading this paper — and 50 million others — with a free Academia account
Used by leading Academics
Andrew M Colman
University of Leicester
Cristina Bicchieri
University of Pennsylvania
Hiroaki Hayakawa
Universiti Brunei Darussalam
Don Ross
University College Cork