Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Offline Bayesian Aleatoric and Epistemic Uncertainty Quantification and Posterior Value Optimisation in Finite-State MDPs

Filippo Valdettaro Brain & Behaviour Lab
Dept. of Computing
Imperial College London, UK
A. Aldo Faisal Brain & Behaviour Lab
Dept. of Computing
Imperial College London, UK
Brain & Behaviour Lab
Dept. of Bioengineering
Imperial College London, UK
Chair in Digital Health & Data Science
University of Bayreuth, Germany
Abstract

We address the challenge of quantifying Bayesian uncertainty and incorporating it in offline use cases of finite-state Markov Decision Processes (MDPs) with unknown dynamics. Our approach provides a principled method to disentangle epistemic and aleatoric uncertainty, and a novel technique to find policies that optimise Bayesian posterior expected value without relying on strong assumptions about the MDP’s posterior distribution. First, we utilise standard Bayesian reinforcement learning methods to capture the posterior uncertainty in MDP parameters based on available data. We then analytically compute the first two moments of the return distribution across posterior samples and apply the law of total variance to disentangle aleatoric and epistemic uncertainties. To find policies that maximise posterior expected value, we leverage the closed-form expression for value as a function of policy. This allows us to propose a stochastic gradient-based approach for solving the problem. We illustrate the uncertainty quantification and Bayesian posterior value optimisation performance of our agent in simple, interpretable gridworlds and validate it through ground-truth evaluations on synthetic MDPs. Finally, we highlight the real-world impact and computational scalability of our method by applying it to the AI Clinician problem, which recommends treatment for patients in intensive care units and has emerged as a key use case of finite-state MDPs with offline data. We discuss the challenges that arise with Bayesian modelling of larger scale MDPs while demonstrating the potential to apply our methods rooted in Bayesian decision theory into the real world. We make our code available at https://github.com/filippovaldettaro/finite-state-mdps.

1 Introduction

In safety-critical machine learning applications, accurately quantifying confidence and uncertainty in decision outcomes becomes imperative for regulatory and trust reasons (Chua et al., 2022; Kendall and Gal, 2017). Uncertainties that such systems face can stem from limited data availability (epistemic) or originate from inherent environmental randomness (aleatoric). Uncertainty quantification is particularly relevant and challenging in reinforcement learning (RL) systems as uncertainty in decisions’ outcomes compounds in sequential decision-making.

We utilise inference schemes from classic Bayesian RL to account for epistemic uncertainty, with an exact-inference Bayesian dynamics model that assigns posterior probabilities to environments Duff (2002). Aleatoric uncertainty is addressed by exploiting the analytic solutions to the linear equations for higher return distribution moments (Sobel, 1982). Our contribution on the uncertainty quantification side is combining these two ingredients to determine overall aleatoric and epistemic standard deviations. We consider the computation and accuracy tradeoff of our method with prior work that does not exploit the tractability of finite-state MDPs.

On the control under uncertainty side, we propose a novel stochastic gradient-based method for policy optimisation that accounts for model dynamics uncertainty by optimising a policy for value under the environment posterior. In contrast to previous methods (Delage and Mannor, 2010), we do not rely on strong assumptions about the posterior distribution. We empirically demonstrate its performance and scalability, providing results on gridworlds and synthetic MDPs with varying offline dataset sizes, where we observe benefits in MDPs with higher uncertainty and lower data. Finally, our methods finds application in clinical decision support systems (CDSS), which leverage vast patient datasets to train RL algorithms for treatment suggestions (Gottesman et al., 2019; Li et al., 2020). We analyse a setup used for sepsis treatment (Komorowski et al., 2018), where patients’ condition and treatment options were clustered into finite states and actions, originally tackled by applying dynamic programming methods that assume known environments (Bellman, 1957). The methods presented enhance this approach with uncertainty quantification and uncertainty-aware control. We investigate the scalability of our methods in such practical environments and investigate the difference in expected posterior value between ours and the original policy optimisation method proposed. For this analysis, we included pessimism in the face of uncertainty, a common and necessary ingredient in offline RL (Kidambi et al., 2020; Yu et al., 2020; An et al., 2021) especially when the dataset does not adequately span the full state-action space (Agarwal et al., 2020), in the form of a conservative dynamics model.

2 Related Work

This section reviews uncertainty treatment in finite-state MDPs. We focus on epistemic uncertainty in Robust and Adaptive MDP settings, aleatoric uncertainty for risk-averse policies, and recent work quantifying both types of uncertainty in finite-state offline decision making contexts.

Robust and Adaptive MDPs. A simple model-based approach for an MDP uses relative visitation frequencies as the ground truth transition probabilities. This can introduce bias and result in policies that generalise poorly (Mannor et al., 2007; Wiesemann et al., 2013; Chow et al., 2015). To address this, a Bayesian approach is often employed to account for uncertainty in ambiguous transition dynamics, a common method in Bayesian RL (Ghavamzadeh et al., 2015). Bayesian dynamics models used in Bayes-Adaptive MDPs (BAMDPs) (Duff, 2002; Guez et al., 2012; Rigter et al., 2021) maintain the current belief in transition dynamics and enable optimal ‘offline’ planning of adaptable ‘online’ policy rollouts. However, these models may be intractable beyond simple MDPs (Poupart et al., 2006; Lee et al., 2018; Zintgraf et al., 2021).

In high-risk offline settings, exploration is undesirable. For instance, in the CDSS developed in (Komorowski et al., 2018), novel actions are avoided by only considering actions above a minimum visitation threshold. Therefore, we focus on optimal memoryless, stationary (non-adaptive) policies depending only on the state (Delage and Mannor, 2010). Finding policies that are robust to the worst-case realisation of uncertain dynamics can often lead to overly conservative policies, making average value optimization across a distribution of MDPs a better alternative (Nilim and Ghaoui, 2003; Iyengar, 2005; Xu and Mannor, 2006), and a principled one in line with Bayesian decision theory Robert et al. (2007). This will be the problem formulation we will be tackling here, while requiring that our methods scale to medium-sized (approximately 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT states) MDPs in data regimes where significant uncertainty is still present, in order for them to be applicable to real-world tasks. Delage and Mannor (2010) proposes a gradient-based method to optimise this objective, but makes the strong assumption that higher moments of the posterior distribution of transition parameters are small. Dimitrakakis (2011) proposes an algorithm that provides provably near-Bayes-optimal stationary policies, but focuses on providing an expected utility bound with respect to the Bayes-optimal adaptive policy, which in general has different utility to the optimal offline stationary policy.

Risk-averse policies. Accounting for inherent environmental stochasticity is often desirable. Using the distributional RL framework (Bellemare et al., 2017), policies are often informed by return distribution properties other than its mean to select risk-averse actions (Dabney et al., 2018; Clements et al., 2019). However, optimal policies for such statistical functionals are generally neither memoryless nor time-consistent (Sobel, 1982; Bellemare et al., 2023). Therefore, we focus on using the mean of the return distribution to guide the agent’s policy.

Aleatoric and Epistemic Uncertainty in Finite-state MDPs. Several recent efforts have tried to model both types of uncertainties in discrete environments. In a healthcare context, Joshi et al. (2021) used a Bayesian dynamics model and Monte Carlo trajectory sampling to estimate uncertainties and determine when to defer treatment. In contrast, Festor et al. (2021) trained an ensemble of distributional deep neural networks (DNNs) to learn the return distribution, effectively learning a ‘distribution over return distributions’. Neither of these works exploit the benefits of having a stationary MDP, and we therefore complement them by directly exploiting the tractability advantages presented by finite-state MDPs, such as closed-form return distribution moments and the possibility for exact inference on the environment dynamics, with methods leading to computationally efficient and accurate uncertainty representation.

3 Background

Dynamic Programming

A Markov Decision Process \mathcal{M}caligraphic_M (MDP) (Puterman, 2014) is given by a tuple (𝒮,𝒜,r,P,γ,ρ)𝒮𝒜𝑟𝑃𝛾𝜌(\mathcal{S},\mathcal{A},r,P,\gamma,\rho)( caligraphic_S , caligraphic_A , italic_r , italic_P , italic_γ , italic_ρ ), where 𝒮𝒮\mathcal{S}caligraphic_S and 𝒜𝒜\mathcal{A}caligraphic_A are the (assumed finite) state and action spaces respectively, r:𝒮:𝑟𝒮r:\mathcal{S}\to\mathbb{R}italic_r : caligraphic_S → blackboard_R is the reward function, P:𝒮×𝒜𝒫(𝒮):𝑃𝒮𝒜𝒫𝒮P:\mathcal{S}\times\mathcal{A}\to\mathcal{P}(\mathcal{S})italic_P : caligraphic_S × caligraphic_A → caligraphic_P ( caligraphic_S ) the transition kernel (𝒫𝒫\mathcal{P}caligraphic_P denoting a probability distribution over the corresponding set), γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) a discount factor and ρ𝜌\rhoitalic_ρ the distribution over initial states. Given a policy π:𝒮𝒫(𝒜):𝜋𝒮𝒫𝒜\pi:\mathcal{S}\to\mathcal{P}(\mathcal{A})italic_π : caligraphic_S → caligraphic_P ( caligraphic_A ), the return of an episode starting from state s𝑠sitalic_s is a random variable given by Gπ(s)=t=0γtRtsuperscript𝐺𝜋𝑠superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑅𝑡G^{\pi}(s)=\sum_{t=0}^{\infty}\gamma^{t}R_{t}italic_G start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where Rt=r(st)subscript𝑅𝑡𝑟subscript𝑠𝑡R_{t}=r(s_{t})italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), atπ(|st)a_{t}\sim\pi(\cdot|s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), stP(|st1,at1)s_{t}\sim P(\cdot|s_{t-1},a_{t-1})italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_P ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) given that s0=ssubscript𝑠0𝑠s_{0}=sitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s. For simplicity, we assume reward as known and only dependent on state. This is a natural modelling step for MDPs where a certain state is associated with a particular reward and in practice is common when constructing MDPs. Nonetheless our methods extend naturally to more general formulations.

The expected value of G𝐺Gitalic_G is called the value function Vπ(s)=𝔼Gπ(s)superscript𝑉𝜋𝑠𝔼superscript𝐺𝜋𝑠V^{\pi}(s)=\mathbb{E}G^{\pi}(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = blackboard_E italic_G start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ), and it can be shown that with this definition, V𝑉Vitalic_V satisfies the Bellman equation

Vπ(s)=r(s)+γa,sP(s|s,a)π(a|s)Vπ(s).superscript𝑉𝜋𝑠𝑟𝑠𝛾subscript𝑎superscript𝑠𝑃conditionalsuperscript𝑠𝑠𝑎𝜋conditional𝑎𝑠superscript𝑉𝜋superscript𝑠V^{\pi}(s)=r(s)+\gamma\sum_{a,s^{\prime}}P(s^{\prime}|s,a)\pi(a|s)V^{\pi}(s^{% \prime}).italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = italic_r ( italic_s ) + italic_γ ∑ start_POSTSUBSCRIPT italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_π ( italic_a | italic_s ) italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (1)

Dynamic programming methods, such as value iteration, can evaluate V𝑉Vitalic_V and provide the policy that optimises V𝑉Vitalic_V (Sutton and Barto, 2018). It can be shown that the value of any arbitrary policy π𝜋\piitalic_π is

𝐯(π)=(𝐈γ𝐓(π))1𝐫,𝐯𝜋superscript𝐈𝛾𝐓𝜋1𝐫\mathbf{v}(\pi)=(\mathbf{I}-\gamma\mathbf{T}(\pi))^{-1}\mathbf{r},bold_v ( italic_π ) = ( bold_I - italic_γ bold_T ( italic_π ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r , (2)

with 𝐯𝐯\mathbf{v}bold_v and 𝐫𝐫\mathbf{r}bold_r being |𝒮|𝒮|\mathcal{S}|| caligraphic_S |-dimensional vectors with ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT element being Vπ(si)superscript𝑉𝜋subscript𝑠𝑖V^{\pi}(s_{i})italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and r(si)𝑟subscript𝑠𝑖r(s_{i})italic_r ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) respectively (for s𝑠sitalic_s the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT state in 𝒮𝒮\mathcal{S}caligraphic_S) and 𝐓(π)𝐓𝜋\mathbf{T}(\pi)bold_T ( italic_π ) the policy-dependent transition matrix with element i,j𝑖𝑗i,jitalic_i , italic_j given by

𝐓i,j=aπ(a|si)P(sj|si,a).subscript𝐓𝑖𝑗subscript𝑎𝜋conditional𝑎subscript𝑠𝑖𝑃conditionalsubscript𝑠𝑗subscript𝑠𝑖𝑎\mathbf{T}_{i,j}=\sum_{a}\pi(a|s_{i})P(s_{j}|s_{i},a).bold_T start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_P ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a ) . (3)

The term (𝐈γ𝐓(π))1superscript𝐈𝛾𝐓𝜋1(\mathbf{I}-\gamma\mathbf{T}(\pi))^{-1}( bold_I - italic_γ bold_T ( italic_π ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can be interpreted as successor features, in terms of which the analytic solution for value has a simple form (Dayan, 1993). For clarity we have highlighted here the dependence of 𝐓𝐓\mathbf{T}bold_T on π𝜋\piitalic_π and note that 𝐫𝐫\mathbf{r}bold_r does not depend on π𝜋\piitalic_π as we assumed state-dependent rewards.

Return Distribution

Unlike traditional distributional RL methods (Bellemare et al., 2017, 2023), we focus solely on the first two moments of the return distribution. This allows us to bypass the full distributional RL framework, as closed-form solutions for these moments can be obtained analytically for a given finite-state MDP.

Methods that solve the Bellman value equation (Eq. 1) can be extended to determine moments of the return distribution. For example, it can be shown that the variances of the return random variable Gπ(s)superscript𝐺𝜋𝑠G^{\pi}(s)italic_G start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) satisfy an analogous set of linear Bellman equations, with solution given in vector form by Sobel (1982):

𝐯𝐚𝐫(π)=(𝐈γ2𝐓(π))1𝐫(var)(π),𝐯𝐚𝐫𝜋superscript𝐈superscript𝛾2𝐓𝜋1superscript𝐫(var)𝜋\mathbf{var}(\pi)=(\mathbf{I}-\gamma^{2}\mathbf{T}(\pi))^{-1}\mathbf{r}^{\text% {(var)}}(\pi),bold_var ( italic_π ) = ( bold_I - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_T ( italic_π ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r start_POSTSUPERSCRIPT (var) end_POSTSUPERSCRIPT ( italic_π ) , (4)

where the vector of variances 𝐯𝐚𝐫𝐯𝐚𝐫\mathbf{var}bold_var has element i𝑖iitalic_i corresponding to the variance at state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐫(var)superscript𝐫(var)\mathbf{r}^{\text{(var)}}bold_r start_POSTSUPERSCRIPT (var) end_POSTSUPERSCRIPT is the vector with element i𝑖iitalic_i being

𝐫i(var)(π)=jPπ(sj|si)(r(si)+γVπ(sj))2Vπ(si)2,subscriptsuperscript𝐫(var)𝑖𝜋subscript𝑗superscript𝑃𝜋conditionalsubscript𝑠𝑗subscript𝑠𝑖superscript𝑟subscript𝑠𝑖𝛾superscript𝑉𝜋subscript𝑠𝑗2superscript𝑉𝜋superscriptsubscript𝑠𝑖2\mathbf{r}^{\text{(var)}}_{i}(\pi)=\sum_{j}P^{\pi}(s_{j}|s_{i})(r(s_{i})+% \gamma V^{\pi}(s_{j}))^{2}-V^{\pi}(s_{i})^{2},bold_r start_POSTSUPERSCRIPT (var) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_π ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_r ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_γ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where Pπ(s|s)=aπ(a|s)P(s|s,a)superscript𝑃𝜋conditionalsuperscript𝑠𝑠subscript𝑎𝜋conditional𝑎𝑠𝑃conditionalsuperscript𝑠𝑠𝑎P^{\pi}(s^{\prime}|s)=\sum_{a}\pi(a|s)P(s^{\prime}|s,a)italic_P start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s ) = ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ).

Bayesian Dynamics Model

The dynamics model we employ is standard in Bayesian RL, and is equivalent to the one used in BAMDPs (Ghavamzadeh et al., 2015; Poupart et al., 2006) with an unchanging belief and similar to the one proposed in Joshi et al. (2021), but stationary. By modelling the belief over the MDP’s dynamics parameters, this line of work effectively captures the uncertainty due to not being able to fully narrow down the true underlying MDP: with a finite number of transitions, there may be several potential MDPs that could have generated the observations, to which we can assign posterior probabilities by using Bayes’ rule. For our purposes, we take the reward function of the MDP as known (and deterministic), ultimately because in our applications we will define reward directly as a deterministic function of state, but treat the dynamics of the world as unknown.

Let θs,assuperscriptsubscript𝜃𝑠𝑎superscript𝑠\theta_{s,a}^{s^{\prime}}italic_θ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be a parameter representing the probability of transitioning to state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT given action a𝑎aitalic_a at state s𝑠sitalic_s, and consider a dataset of observed transitions (s,a,r,s)𝒟𝑠𝑎𝑟superscript𝑠𝒟(s,a,r,s^{\prime})\in\mathcal{D}( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_D. The probability of transitioning to some next-state follows a multinomial distribution with parameters given by θ𝜃\thetaitalic_θ, and we can specify a conjugate Dirichlet prior on these so that for each state-action the resulting posterior probability is also Dirichlet. Assuming a symmetric Dirichlet prior (independent across different state-actions) with parameter αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, the posterior distribution satisfies

p({θs,asi|si𝒮}|𝒟)j(θs,asj)nj+αp1,proportional-to𝑝conditionalconditional-setsuperscriptsubscript𝜃𝑠𝑎subscript𝑠𝑖subscript𝑠𝑖𝒮𝒟subscriptproduct𝑗superscriptsuperscriptsubscript𝜃𝑠𝑎subscript𝑠𝑗subscript𝑛𝑗subscript𝛼𝑝1p(\{\theta_{s,a}^{s_{i}}|s_{i}\in\mathcal{S}\}|\mathcal{D})\propto\prod_{j}(% \theta_{s,a}^{s_{j}})^{n_{j}+\alpha_{p}-1},italic_p ( { italic_θ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S } | caligraphic_D ) ∝ ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT , (6)

with njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT being the number of times s,a𝑠𝑎s,aitalic_s , italic_a transitioned to state sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and the proportionality constant is given (in closed form) by the multivariate Beta function (Kotz et al., 2004).

When the number of possible outcomes, in this case next states, is large then inference on the Dirichlet parameters can be very data-inefficient: if a generic maximum-entropy prior parameter is employed it can assign a disproportionate amount of posterior probability to unobserved outcomes. To mitigate this, one may scale the prior parameter inversely to the number of outcomes, as done in a BAMDP context in Guez et al. (2012), or induce sparsity in the possible outcomes by modelling the belief of feasible next states through a hierarchical Bayesian model (Friedman and Singer, 1998). We will address this same issue in section 5.3 by employing a sparse Dirichlet model.

Aleatoric and Epistemic Uncertainty

In order to quantify and distinguish between epistemic uncertainty due to ambiguity in MDPs \mathcal{M}caligraphic_M given limited data and aleatoric uncertainty in the return G𝐺Gitalic_G, we use the common decomposition formula that arises after applying the law of total variance (Kendall and Gal, 2017; Joshi et al., 2021) to the return G𝐺Gitalic_G:

VarG(s)=Var𝔼G(s)epistemic+𝔼VarG(s)aleatoric,Var𝐺𝑠subscriptsubscriptVar𝔼subscript𝐺𝑠epistemicsubscriptsubscript𝔼Varsubscript𝐺𝑠aleatoric\mathop{\text{Var}}{G(s)}=\underbrace{{\text{Var}}_{\mathcal{M}}{\mathbb{E}{\,% G_{\mathcal{M}}(s)}}}_{\text{epistemic}}+\underbrace{{\mathbb{E}}_{\mathcal{M}% }{\mathop{\text{Var}}{G_{\mathcal{M}}(s)}}}_{\text{aleatoric}},Var italic_G ( italic_s ) = under⏟ start_ARG Var start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT blackboard_E italic_G start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) end_ARG start_POSTSUBSCRIPT epistemic end_POSTSUBSCRIPT + under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT Var italic_G start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) end_ARG start_POSTSUBSCRIPT aleatoric end_POSTSUBSCRIPT , (7)

where we have made clear that the dependence on the return random variable G𝐺Gitalic_G is conditioned on the MDPs \mathcal{M}caligraphic_M, so that the inner expectations and variances are marginalising over returns for a given MDP and the outer expectations and variances are marginalising over distributions of MDPs. The epistemic variance term captures the overall variance in the expected returns due to ambiguity in the MDPs and the aleatoric variance term is an estimate of the intrinsic variance averaged over the posterior MDP distribution. Eqs. 2 and 4 allow us to determine 𝔼G(s)=V(s)𝔼subscript𝐺𝑠subscript𝑉𝑠\mathbb{E}{\,G_{\mathcal{M}}(s)}=V_{\mathcal{M}}(s)blackboard_E italic_G start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) = italic_V start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) and VarG(s)Varsubscript𝐺𝑠\mathop{\text{Var}}{G_{\mathcal{M}}(s)}Var italic_G start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) exactly, while averages and variances over the MDPs can be approximated through Monte Carlo sampling of the posterior over MDPs. In the limit of infinite data, the epistemic variance will tend to 00 as the probability mass of the posterior focuses in on a specific \mathcal{M}caligraphic_M, but the aleatoric term will not necessarily behave similarly.

Bayesian Objective

Beyond evaluating uncertainty, having a belief over the possible range of dynamics that an MDP can exhibit can allow us to account for this uncertain belief when carrying out control. Bayesian decision theory dictates that the optimal decision rule for a given prior belief and observed data is the one that maximises posterior expected value (Robert et al., 2007). Thus, we seek to find a policy that maximises the posterior expected value objective

maxπsρ(s)𝔼p(|𝒟)Vπ(s),\max_{\pi}\sum_{s}\rho(s)\mathbb{E}_{\mathcal{M}\sim p(\cdot|\mathcal{D})}V_{% \mathcal{M}}^{\pi}(s),roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ ( italic_s ) blackboard_E start_POSTSUBSCRIPT caligraphic_M ∼ italic_p ( ⋅ | caligraphic_D ) end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) , (8)

where the value of each state 𝔼p(|𝒟)Vπ(s)\mathbb{E}_{\mathcal{M}\sim p(\cdot|\mathcal{D})}V_{\mathcal{M}}^{\pi}(s)blackboard_E start_POSTSUBSCRIPT caligraphic_M ∼ italic_p ( ⋅ | caligraphic_D ) end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) has been marginalised with respect to the initial state distribution ρ𝜌\rhoitalic_ρ. This approach is consistent with previous literature that establishes the benefits of optimising this objective for decision-making in uncertain MDPs (Nilim and Ghaoui, 2003; Iyengar, 2005; Xu and Mannor, 2006). Thus, this objective will be one of the performance metrics we will use to evaluate different algorithms. Delage and Mannor (2010) addresses finding a policy that performs well on this objective, but their approach relies on a second-order expansion of the value in terms of the MDP parameters’ posterior distribution moments, and must thus assume small posterior uncertainty to be successful.

4 Methods

4.1 Uncertainty quantification

Some proposed approaches for jointly estimating aleatoric and epistemic uncertainty in discrete-space MDPs either overlook uncertainty in the transition model (Festor et al., 2021) or rely on extensive Monte Carlo sampling (Joshi et al., 2021). As a consequence, the former does not scale consistently with additional data (see Appendix D for empirical evidence for this claim) and we can improve on the latter in some regimes for the infinite-horizon MDP case by using closed-form expressions for the first two moments of the return distribution.

We present in Algorithm 1 a way to estimate posterior value, aleatoric and epistemic variances in Eq. 7, that exploits the finite-state stationary nature of the MDPs considered here. Its computational complexity scales as O(|𝒮|3)𝑂superscript𝒮3O(|\mathcal{S}|^{3})italic_O ( | caligraphic_S | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) due to requiring an |𝒮|×|𝒮|𝒮𝒮|\mathcal{S}|\times|\mathcal{S}|| caligraphic_S | × | caligraphic_S | matrix inversion for each of the NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT dynamics samples. In contrast, methods that rely on Monte Carlo return samples to estimate aleatoric and epistemic return will require a larger number of Dirichlet samples and large simulation trajectory lengths to achieve comparable accuracy, but no matrix inversion.

We investigate this trade-off quantitatively in Appendix A and conclude that the larger number of samples required for a full Monte Carlo-style evaluation (similar to Joshi et al. (2021)) is not worth the additional sampling overhead for the MDPs we are considering (γ=0.999,|𝒮|<1000formulae-sequence𝛾0.999𝒮1000\gamma=0.999,\mathcal{|S|}<1000italic_γ = 0.999 , | caligraphic_S | < 1000). In particular, we show that for large γ𝛾\gammaitalic_γ, finding exact solutions for values using analytic forms will be more computationally efficient as longer rollouts become necessary to have accurate return samples and more posterior samples become necessary to decrease the error from Monte Carlo sampling. In principle one could also use some iterative policy evaluation scheme (Sutton and Barto, 2018) to solve for the first and second moments of the return distribution, sacrificing a small amount of accuracy but avoiding a matrix inverse calculation.

Algorithm 1 Bayesian Value, Epistemic and Aleatoric Uncertainty Evaluation
Policy π𝜋\piitalic_π, state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, posterior distribution over transition parameters p(|𝒟)𝑝conditional𝒟p(\mathcal{M}|\mathcal{D})italic_p ( caligraphic_M | caligraphic_D )
θsas{1:NM}NMsubscriptsubscriptsuperscript𝜃superscript𝑠𝑠𝑎conditional-set1subscript𝑁𝑀subscript𝑁𝑀{\theta^{s^{\prime}}_{sa}}_{\{1:N_{M}\}}\leftarrow N_{M}italic_θ start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT { 1 : italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ← italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT matrix samples from p(|𝒟)𝑝conditional𝒟p(\mathcal{M}|\mathcal{D})italic_p ( caligraphic_M | caligraphic_D )
for sS,sSformulae-sequence𝑠𝑆superscript𝑠𝑆s\in S,s^{\prime}\in Sitalic_s ∈ italic_S , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S do
     {𝐓ss}{1:NM}aπ(a|s)θsa{1:NM}ssubscriptsubscript𝐓𝑠superscript𝑠conditional-set1subscript𝑁𝑀subscript𝑎𝜋conditional𝑎𝑠superscriptsubscript𝜃𝑠𝑎conditional-set1subscript𝑁𝑀superscript𝑠\{\mathbf{T}_{ss^{\prime}}\}_{\{1:N_{M}\}}\leftarrow\sum_{a}\pi(a|s)\theta_{sa% \,\{1:N_{M}\}}^{s^{\prime}}{ bold_T start_POSTSUBSCRIPT italic_s italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT { 1 : italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_θ start_POSTSUBSCRIPT italic_s italic_a { 1 : italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT \triangleright NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT action-marginalised transition matrices
end for
for t=1𝑡1t=1italic_t = 1 to NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT do
     𝐯t(𝐈γ𝐓t)1𝐫subscript𝐯𝑡superscript𝐈𝛾subscript𝐓𝑡1𝐫\mathbf{v}_{t}\leftarrow(\mathbf{I}-\gamma\mathbf{T}_{t})^{-1}\mathbf{r}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ( bold_I - italic_γ bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r \triangleright Eq. 2 for samples
     for sk𝒮subscript𝑠𝑘𝒮s_{k}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_S do
         Vksubscript𝑉𝑘absentV_{k}\leftarrowitalic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← element k𝑘kitalic_k of 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     end for
     for sk𝒮subscript𝑠𝑘𝒮s_{k}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_S do
         𝐫k(var)j{𝐓sksj}t(r(sk)+γVj)2Vk2subscriptsuperscript𝐫(var)𝑘subscript𝑗subscriptsubscript𝐓subscript𝑠𝑘subscript𝑠𝑗𝑡superscript𝑟subscript𝑠𝑘𝛾subscript𝑉𝑗2superscriptsubscript𝑉𝑘2\mathbf{r}^{\text{(var)}}_{k}\leftarrow\sum_{j}\{\mathbf{T}_{s_{k}s_{j}}\}_{t}% (r(s_{k})+\gamma V_{j})^{2}-V_{k}^{2}bold_r start_POSTSUPERSCRIPT (var) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT { bold_T start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + italic_γ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
     end for
     𝐯𝐚𝐫t(𝐈γ2𝐓t)1𝐫(var)subscript𝐯𝐚𝐫𝑡superscript𝐈superscript𝛾2subscript𝐓𝑡1superscript𝐫(var)\mathbf{var}_{t}\leftarrow(\mathbf{I}-\gamma^{2}\mathbf{T}_{t})^{-1}\mathbf{r}% ^{\text{(var)}}bold_var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← ( bold_I - italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r start_POSTSUPERSCRIPT (var) end_POSTSUPERSCRIPT \triangleright Equation 4
     vtsubscript𝑣𝑡absentv_{t}\leftarrowitalic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← element i𝑖iitalic_i of 𝐯tsubscript𝐯𝑡\mathbf{v}_{t}bold_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
     vart𝑣𝑎subscript𝑟𝑡absentvar_{t}\leftarrowitalic_v italic_a italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← element i𝑖iitalic_i of 𝐯𝐚𝐫tsubscript𝐯𝐚𝐫𝑡\mathbf{var}_{t}bold_var start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
end for
bayes_value 1NMt=1NMvtabsent1subscript𝑁𝑀superscriptsubscript𝑡1subscript𝑁𝑀subscript𝑣𝑡\leftarrow\frac{1}{N_{M}}\sum_{t=1}^{N_{M}}v_{t}← divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
aleatoric_var 1NMt=1NMvartabsent1subscript𝑁𝑀superscriptsubscript𝑡1subscript𝑁𝑀𝑣𝑎subscript𝑟𝑡\leftarrow\frac{1}{N_{M}}\sum_{t=1}^{N_{M}}var_{t}← divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_v italic_a italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
epistemic_var 1NM1t=1NM(vtbayes_value)2absent1subscript𝑁𝑀1superscriptsubscript𝑡1subscript𝑁𝑀superscriptsubscript𝑣𝑡bayes_value2\leftarrow\frac{1}{N_{M}-1}\sum_{t=1}^{N_{M}}(v_{t}-\text{bayes\_value})^{2}← divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bayes_value ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT return bayes_value, aleatoric_var, epistemic_var

4.2 Policy improvement

Unlike with a single MDP, the objective in Eq. 8 does not always admit a deterministic optimal policy (we provide an example in Appendix B where the optimal policy is stochastic). For this reason, approaches analogous to classical dynamic programming cannot find an optimal policy. We suggest a gradient-based approach to optimise this objective in Algorithm 2. We approach the optimisation by taking stochastic gradient steps of the value objective with respect to a parametrised stochastic policy, which is made possible by the analytic form for value for given parameter samples. This approach is qualitatively distinct to the classic policy gradient in RL which estimates policy gradients from rolled-out trajectories (Sutton and Barto, 2018) . In contrast to other methods (Komorowski et al., 2018; Dimitrakakis, 2011) this does not introduce bias due to optimising only with respect to a finite number of transition samples: by re-sampling from the posterior every gradient step, we remove the bias that would occur by picking a smaller finite sample, and we note that all standard stochastic gradient optimisation guarantees regarding computational complexity or convergence to a local optimum will apply. For example, one can show that with appropriate learning rate scheduling, this convergence is guaranteed almost surely (Bottou, 1998) (although we empirically found that convergence was also achieved with a constant learning rate). Note that since γ<1𝛾1\gamma<1italic_γ < 1, all quantities (values, variances) are bounded, continuous and differentiable functions of policy parameters. We also remark that Algorithm 2 can be implemented faster computationally by reducing the batch size or by resampling the posterior periodically rather than at every gradient step (see Appendix C for computational benchmarks).

Algorithm 2 Stochastic Gradient Policy Optimisation
Initial deterministic π𝜋\piitalic_π, posterior distribution over transition parameters p(|𝒟)𝑝conditional𝒟p(\mathcal{M}|\mathcal{D})italic_p ( caligraphic_M | caligraphic_D ), initial policy softness parameter η𝜂\etaitalic_η, learning rate α𝛼\alphaitalic_α
s𝒮,a𝒜,zsalog(η/(|𝒜|1))formulae-sequencefor-all𝑠𝒮formulae-sequence𝑎𝒜subscript𝑧𝑠𝑎𝜂𝒜1\forall s\in\mathcal{S},a\in\mathcal{A},\,z_{sa}\leftarrow\log(\eta/(|\mathcal% {A}|-1))∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A , italic_z start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT ← roman_log ( italic_η / ( | caligraphic_A | - 1 ) )
s𝒮,zsπ(s)log(1η)formulae-sequencefor-all𝑠𝒮subscript𝑧𝑠𝜋𝑠1𝜂\forall s\in\mathcal{S},\,z_{s\pi(s)}\leftarrow\log(1-\eta)∀ italic_s ∈ caligraphic_S , italic_z start_POSTSUBSCRIPT italic_s italic_π ( italic_s ) end_POSTSUBSCRIPT ← roman_log ( 1 - italic_η ) \triangleright Set initial π𝜋\piitalic_π params
while not converged do
     s𝒮,a𝒜,letπ(a|s)exp(zsa)aexp(zsa)formulae-sequencefor-all𝑠𝒮formulae-sequence𝑎𝒜let𝜋conditional𝑎𝑠subscript𝑧𝑠𝑎superscriptsubscript𝑎subscript𝑧𝑠superscript𝑎\forall s\in\mathcal{S},a\in\mathcal{A},\,\text{let}\,\pi(a|s)\leftarrow\frac{% \exp(z_{sa})}{\sum_{a}^{\prime}\exp({z_{sa^{\prime}}})}∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A , let italic_π ( italic_a | italic_s ) ← divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_s italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG
     θsas{1:n}nsubscriptsubscriptsuperscript𝜃superscript𝑠𝑠𝑎conditional-set1𝑛𝑛{\theta^{s^{\prime}}_{sa}}_{\{1:n\}}\leftarrow nitalic_θ start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT { 1 : italic_n } end_POSTSUBSCRIPT ← italic_n minibatch samples from p(|𝒟)𝑝conditional𝒟p(\mathcal{M}|\mathcal{D})italic_p ( caligraphic_M | caligraphic_D )
     for sS,sSformulae-sequence𝑠𝑆superscript𝑠𝑆s\in S,s^{\prime}\in Sitalic_s ∈ italic_S , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_S do
         {𝐓ss}{1:NM}aπ(a|s)θsa{1:NM}ssubscriptsubscript𝐓𝑠superscript𝑠conditional-set1subscript𝑁𝑀subscript𝑎𝜋conditional𝑎𝑠superscriptsubscript𝜃𝑠𝑎conditional-set1subscript𝑁𝑀superscript𝑠\{\mathbf{T}_{ss^{\prime}}\}_{\{1:N_{M}\}}\leftarrow\sum_{a}\pi(a|s)\theta_{sa% \,\{1:N_{M}\}}^{s^{\prime}}{ bold_T start_POSTSUBSCRIPT italic_s italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT { 1 : italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π ( italic_a | italic_s ) italic_θ start_POSTSUBSCRIPT italic_s italic_a { 1 : italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT \triangleright NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT action-marginalised transition matrices
     end for
     𝐯1:NM(𝐈γ𝐓1:NM)1𝐫subscript𝐯:1subscript𝑁𝑀superscript𝐈𝛾subscript𝐓:1subscript𝑁𝑀1𝐫\mathbf{v}_{1:N_{M}}\leftarrow(\mathbf{I}-\gamma\mathbf{T}_{1:N_{M}})^{-1}% \mathbf{r}bold_v start_POSTSUBSCRIPT 1 : italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← ( bold_I - italic_γ bold_T start_POSTSUBSCRIPT 1 : italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r \triangleright Eq. 2 for samples
     =iρ𝐯𝐢subscript𝑖𝜌subscript𝐯𝐢\mathcal{L}=-\sum_{i}\mathbf{\rho}\cdot\mathbf{v_{i}}caligraphic_L = - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ρ ⋅ bold_v start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT \triangleright Posterior and state marginalised
     s𝒮,a𝒜,zsazsaαzsaformulae-sequencefor-all𝑠𝒮formulae-sequence𝑎𝒜subscript𝑧𝑠𝑎subscript𝑧𝑠𝑎𝛼subscript𝑧𝑠𝑎\forall s\in\mathcal{S},a\in\mathcal{A},\,z_{sa}\leftarrow z_{sa}-\alpha\frac{% \partial\mathcal{L}}{\partial z_{sa}}∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A , italic_z start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT ← italic_z start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT - italic_α divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_z start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT end_ARG \triangleright Gradient step
end while
s𝒮,a𝒜,π(a|s)exp(zsa)aexp(zsa)formulae-sequencefor-all𝑠𝒮formulae-sequence𝑎𝒜𝜋conditional𝑎𝑠subscript𝑧𝑠𝑎superscriptsubscript𝑎subscript𝑧𝑠superscript𝑎\forall s\in\mathcal{S},a\in\mathcal{A},\,\pi(a|s)\leftarrow\frac{\exp(z_{sa})% }{\sum_{a}^{\prime}\exp({z_{sa^{\prime}}})}∀ italic_s ∈ caligraphic_S , italic_a ∈ caligraphic_A , italic_π ( italic_a | italic_s ) ← divide start_ARG roman_exp ( italic_z start_POSTSUBSCRIPT italic_s italic_a end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT roman_exp ( italic_z start_POSTSUBSCRIPT italic_s italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG return π𝜋\piitalic_π

5 Experiments

Here we apply the proposed method on toy environments and a real-world clinical dataset. The toy environments demonstrate the salient features of our methods where ground-truth MDPs can be easily generated and interpreted, while the application to clinical data confirms its scalability to MDPs with practical use. We first examine uncertainty evaluation on interpretable gridworlds for a specific policy and then consider policy optimisation on gridworlds and synthetic MDPs. Finally we apply the same methods to the MIMIC-III dataset (Johnson et al., 2016), and present results on the impact that carrying out our approach has on this dataset’s posterior expected value.

5.1 Gridworld

We consider a gridworld with stochastic transitions: at each step there is a probability prandsubscript𝑝randp_{\text{rand}}italic_p start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT of being pushed down regardless of action taken. Otherwise, the agent moves up, down, left or right by one square determined by the action. The observed transitions dataset 𝒟𝒟\mathcal{D}caligraphic_D is generated by repeatedly spawning an agent in a non-terminal random state and carrying out a random action. Experiments are ran on the gridworld visualised in Fig. 1(a). The results presented here are for datasets of varying sizes, where smaller ones are always subsets of any larger ones to ensure that the latter are strictly more informative.

Uncertainty Quantification

We consider the policy uncertainty evaluation problem, comparing how results from our Bayesian approach differ from others when evaluating aleatoric and epistemic uncertainty for the policy that is optimal under the MLE dynamics parameter estimates. We see in Fig. 1(b) that the uncertainty quantification results from applying Algorithm 1 scale consistently with varying dataset size (epistemic uncertainty always becomes small with more data) and intrinsic stochasticity (higher prandsubscript𝑝randp_{\text{rand}}italic_p start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT corresponds to higher aleatoric uncertainty). In contrast, we find that the approach in Festor et al. (2021) always leads to low epistemic uncertainty at the end of training, as the lack of knowledge of the underlying MDP is not modelled, and thus does not scale consistently with data. In Appendix D we visualise how epistemic uncertainty evolves during training with different datasets. We adapt their algorithm to carry out SARSA policy evaluation on the same, fixed policy and observe that it always tends to be small regardless of how informative the dataset is by the end of training. Additionally, as discussed in section 4.1, the computation of aleatoric and epistemic uncertainty through closed-form moments as in Algorithm 1 does not require averaging samples over episodic rollouts as in Joshi et al. (2021).

Refer to caption

(a) Gridworld
Refer to caption
(b) Aleatoric and epistemic uncertainty
Figure 1: Fig. 1(a) shows the gridworld used in the experiments. The terminal states are the failure F states (cliff) and the goal G state. The agent can move up, down, left, or right (or remain stationary if it hits the boundary of the grid). The transition dynamics have intrinsic stochasticity controlled by the probability prandsubscript𝑝randp_{\text{rand}}italic_p start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT, which is the probability of pushing the agent down regardless of action taken. Offline training datasets were generated by randomly sampling actions at random non-terminal states. State \bigstar is chosen as an exemplar state to plot state-dependent uncertainties. In Fig. 1(b), the plot shows the epistemic (blue) and aleatoric (red) standard deviations as a function of training dataset size, with different levels of intrinsic stochasticity indicated by solid, dashed, and dotted lines.

Bayesian Policy Optimisation

An optimal memoryless policy that accounts for the model uncertainty maximises the posterior expected value given in Eq. 8. We compare the performance on this objective of four policies: the optimal policy when transition probabilities are modelled as naive visitation frequencies (MLE-optimal policy), the optimal policy for the expected or marginalised (referred to as nominal) MDP (Nominal policy) (Dimitrakakis, 2011; Delage and Mannor, 2010), the policy derived from the second-order approximation of value in terms of posterior moments proposed in Delage and Mannor (2010) (Second order policy) and ours, described in Algorithm 2 (Gradient policy). In Algorithm 2 and the second order policy, we choose the initial policies to be a softened version of the Nominal policy (η=0.1)𝜂0.1(\eta=0.1)( italic_η = 0.1 ). Just like the MLE-optimal policy, the required amount of computation is one round of value iteration (Sutton and Barto, 2018) and is therefore a computationally negligible addition to the algorithm.

A further method that optimises a similar objective is the Multi-Sample Backwards Induction (MSBI policy) algorithm suggested in Dimitrakakis (2011). However, this algorithm was originally devised to find a policy that achieves a near-optimal lower bound on the utility with respect to the Bayes-adaptive policy, which is a different task to the one considered here. Nonetheless, for completeness we report the corresponding results for this method in Appendix E, and found that it did not outperform ours on any of the metrics considered.

Refer to caption
(a) Single run
Refer to caption
(b) Gradient vs MLE-optimal
Refer to caption
(c) Gradient vs Nominal
Refer to caption
(d) Gradient vs Second order
Figure 2: Fig. 2(a) shows the average posterior expected return (‘Value’) as a function of dataset size for a single set of generated datasets, as in the objective in Eq. 8. The example gridworld has prand=0.25subscript𝑝rand0.25p_{\text{rand}}=0.25italic_p start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT = 0.25. As value will be dataset-dependent, we show the average and standard deviation between the pairwise difference in posterior values between ours and the other methods in Figs. 2(b)2(c) and 2(d), where values above the red dashed line signify an improvement. These plots report the average and standard deviation across 50 generated datasets for each dataset size.

We empirically find that the gradient-optimised policy consistently outperforms the other methods on optimising the posterior value objectives, especially in lower data regimes when optimising the Bayesian posterior value objective. Results from a sample run are presented in Fig. 2(a) for different dataset sizes. The corresponding relative performances of our method against both MLE-optimal and second order policies on the posterior value objective over 50 sets of generated datasets are presented in Fig. 2(b), 2(c) and 2(d) with error bars (standard deviations), confirming that our method consistently outperforms the other two in terms of posterior value maximisation over a larger number of randomly-generated datasets.

5.2 Synthetic MDPs

While gridworlds are convenient to interpret results relating to uncertainty disentanglement, they are not adequate for repeated experimentation and evaluation on multiple ground truth environments. Therefore, we present here results on unstructured, synthetic MDPs that allow us to meaningfully evaluate the ground-truth performance of learned policies on a large number of MDPs.

The MDPs we consider have 5 states, 5 actions and are generated by sampling the ground-truth transition probabilities independently for each state-action from a flat Dirichlet prior (with any state being a valid next state). To break the symmetry between states, we sample the state-dependent reward from a normal Gaussian and keep these constant throughout all experiments. The datasets are generated by sampling the outcome of visiting each state-action between 1 and 10 times, resulting in different dataset sizes. For each dataset size, we generate 250 different MDPs and datasets and train the various policies on these datasets. Finally, we roll out the policies and evaluate them on the ground-truth MDP that generated the data (for 1k steps and with η=0.5𝜂0.5\eta=0.5italic_η = 0.5). Similarly to the previous section, the intrinsic ‘luck’ associated with MDPs generated can affect the maximum value that each policy is able to achieve, so we focus on the difference in performance between methods for each MDP. In Fig. 3, we display the ground-truth relative performance of the various policies compared to ours. For all ground-truth results, we display standard error of the mean rather than standard deviation as we are interested in average performance across prior MDP samples rather than the variability over each individual sample. We also report the performance on the posterior expected value, as with the gridworlds, in Fig. 4. In Appendix F, we present the same results in Figs. 3 and 4 with the y𝑦yitalic_y-values rescaled to reflect fractional improvements rather than absolute values.

Refer to caption
(a) Gradient vs MLE-optimal
Refer to caption
(b) Gradient vs Nominal
Refer to caption
(c) Gradient vs Second order
Figure 3: Ground truth pairwise difference in average performance (and shaded standard error of the mean) on the policies found by each method and rolled out on the ground-truth synthetic MDP. Regions above the red line correspond to improved performance with our method.
Refer to caption
(a) Gradient vs MLE-optimal
Refer to caption
(b) Gradient vs Nominal
Refer to caption
(c) Gradient vs Second order
Figure 4: Average and standard deviation (shaded) of posterior expected value difference achieved by our method. Regions above the red line correspond to improved objective optimisation with our method.

We notice that our stochastic gradient/based method consistently outperforms the others in the low data regimes, both for ground truth performance as well as optimisation performance. MLE and Nominal policies can be very sub-optimal in the low data regime and then significantly improve with more data, suggesting that with more data the gradient-based optimisation may not be necessary, whereas the second order policy is consistently slightly sub-optimal compared to ours.

5.3 Clinical Data

We apply Algorithm 2 to the MIMIC-III dataset, as in Komorowski et al. (2018) and Festor et al. (2021), using the same clustering of 752 states and 25 actions. Two terminal states represent patient recovery and death, with reward 1 for a patient’s recovery and 0 for death. Thus value corresponds to probability of survival when γ1𝛾1\gamma\approx 1italic_γ ≈ 1 (γ=0.999𝛾0.999\gamma=0.999italic_γ = 0.999). As in Komorowski et al. (2018), actions at any state with fewer than 5 visits in the dataset are excluded. We address here two main points. First we confirm that our method can scale computationally to real-world MDPs and datasets. Secondly, we investigate the impact on the posterior expected value when employing our policy compared to the MLE-optimal one as in the original work.

Fig. 5 shows the posterior expected value of the two policies under two different choices of dynamics prior. Fig. 5a corresponds to a symmetric Dirichlet prior chosen via Bayesian model selection. The posterior probability mass over transition parameters still has a high entropy causing the agent to believe transitions are essentially random. In Fig. 5b, we employ a conservative sparse dynamics model that only includes the death state and any observed next states in the dataset as possible next-state outcomes for each state-action. Here, we notice that the posterior expected value can be significantly increased by using our policy optimisation algorithm, suggesting that we are in a data regime where the choice of algorithm for policy selection is important. We defer more detailed discussion and a visualisation of resulting uncertainties to Appendix G.

Refer to caption
Figure 5: Posterior of each state (blue dots) under our policy and the MLE-optimal policy in the clinical MDP. Points above the diagonal indicate superior performance of our policy on the posterior expect value. The left plot (a) demonstrates the impact of policy choice on performance when employing Bayesian model selection with an optimal parameter of αp=0.072subscript𝛼𝑝0.072\alpha_{p}=0.072italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.072. The right plot (b) shows the same result when using a prior selected through a conservative sparse dynamics model.

6 Limitations

Our methods are investigated for a specific category of Markov Decision Processes (MDPs) with finite states and known reward structures. While our method is well-suited for the lower data regimes, we empirically observe that it can be slightly suboptimal compared to the classical dynamic programming baselines, in particular the “Nominal” policy, in higher data regimes where uncertainty in the underlying transitions is low. Nonetheless, in practice this can be detected by comparing the Bayesian objective of the nominal policy to the posterior expected value achieved by the policy after our stochastic gradient optimisation and choose the one with the better posterior expected value. We have shown our approach can handle moderately-sized MDPs that carry practical real-world application possibilities in section 5.3). However, it relies on matrix inversion (𝒪(|𝒮|3)𝒪superscript𝒮3\mathcal{O}\left(|\mathcal{S}|^{3}\right)caligraphic_O ( | caligraphic_S | start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) complexity) so it cannot directly scale to much larger MDPs. In Appendix C we show empirical results on how our method scales to larger state-spaces on the computational side. One key limitation of our proposed methods towards real-world application is the sensitivity of the resulting policy and inferred values on the dynamics model prior used, especially when data is inadequate for effective inference across all dynamics priors. For example, we observe that the effects of having a sparse or evidence-optimised model can be significant on both the inferred policy and the associated posterior values (see Fig. 5) and exactly how to best include or combine these elements to select a prior that achieves consistently good performance on real-world MDPs is an important question and one that we defer to future work.

7 Conclusion

We have proposed methods to estimate Bayesian aleatoric and epistemic uncertainty in the outcome of finite-state space policies and to maximise posterior expected value. We offer a real-world example application of our method in a prominent case of discrete-state offline RL (Komorowski et al., 2018) in clinical decision support systems. In contrast to previous approaches that estimate such uncertainties in MDPs with finite states, we directly exploit the tractability of stationary MDPs to avoid potentially computationally expensive or inaccurate episodic rollouts (necessary in non-stationary MDPs (Joshi et al., 2021)) or employing ensemble of model-free approaches that may overlook dynamics uncertainty (Clements et al., 2019; Festor et al., 2021).

On the control side, we introduced a stochastic gradient-based method to optimise posterior expected value (Xu and Mannor, 2006) that, unlike previous approaches (Delage and Mannor, 2010), does not make strong assumptions on the posterior’s distribution and does not introduce bias from having a finite number of posterior samples (Dimitrakakis, 2011). Through numerical simulations, we have shown that our method consistently improves on the posterior value objective as well as performance on ground-truth MDPs, particularly in low data regimes, when these are unknown and sampled from a given prior. Our method can be extended to optimise value over any distribution of MDPs that can be sampled from, including those with uncertain or more expressive rewards (of the form R(s,a,s)𝑅𝑠𝑎superscript𝑠R(s,a,s^{\prime})italic_R ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )) as these also have differentiable closed-form expressions for value in terms of policy (Sobel, 1982). We apply our method to a clinical dataset, confirming its computational scalability, and notice that the resulting policy significantly improves the posterior expected values compared to that in the original approach (Komorowski et al., 2018). We suggested domain-specific conservatism in the dynamics model as a potential solution to new challenges that arise in this task and a starting point for further work towards finding offline policies with robust ground-truth performance in finite-state MDPs.

Acknowledgements.
FV was supported by a Department of Computing PhD scholarship and AAF was supported by a UKRI Turing AI Fellowship (EP/V025449/1).

References

  • Agarwal et al. (2020) Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. An optimistic perspective on offline reinforcement learning. In International Conference on Machine Learning, pages 104–114. PMLR, 2020.
  • An et al. (2021) Gaon An, Seungyong Moon, Jang-Hyun Kim, and Hyun Oh Song. Uncertainty-based offline reinforcement learning with diversified q-ensemble. Advances in Neural Information Processing Systems, 34:7436–7447, 2021.
  • Bellemare et al. (2017) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458. PMLR, 2017.
  • Bellemare et al. (2023) Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
  • Bellman (1957) Richard Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1 edition, 1957.
  • Bottou (1998) Léon Bottou. Online learning and stochastic approximations. Online Learning in Neural Networks, 17(9):142, 1998.
  • Chow et al. (2015) Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. Advances in Neural Information Processing Systems, 28, 2015.
  • Chua et al. (2022) Michelle Chua, Doyun Kim, Jongmun Choi, Nahyoung G. Lee, Vikram Deshpande, Joseph Schwab, Michael H. Lev, Ramon G. Gonzalez, Michael S. Gee, and Synho Do. Tackling prediction uncertainty in machine learning for healthcare. Nature Biomedical Engineering, Dec 2022. ISSN 2157-846X. 10.1038/s41551-022-00988-x.
  • Clements et al. (2019) William R Clements, Bastien Van Delft, Benoît-Marie Robaglia, Reda Bahi Slaoui, and Sébastien Toth. Estimating risk and uncertainty in deep reinforcement learning. arXiv preprint arXiv:1905.09638, 2019.
  • Dabney et al. (2018) Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Dayan (1993) Peter Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
  • Delage and Mannor (2010) Erick Delage and Shie Mannor. Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1):203–213, 2010.
  • Dimitrakakis (2011) Christos Dimitrakakis. Robust Bayesian reinforcement learning through tight lower bounds. In European Workshop on Reinforcement Learning, pages 177–188. Springer, 2011.
  • Duff (2002) Michael O’Gordon Duff. Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes. University of Massachusetts Amherst, 2002.
  • Festor et al. (2021) Paul Festor, Giulia Luise, Matthieu Komorowski, and A Aldo Faisal. Enabling risk-aware reinforcement learning for medical interventions through uncertainty decomposition. In Workshop in Interpretable Machine Learning for Healthcare, 2021.
  • Friedman and Singer (1998) Nir Friedman and Yoram Singer. Efficient Bayesian parameter estimation in large discrete domains. Advances in Neural Information Processing Systems, 11, 1998.
  • Ghavamzadeh et al. (2015) Mohammad Ghavamzadeh, Shie Mannor, Joelle Pineau, Aviv Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(5-6):359–483, 2015.
  • Gottesman et al. (2019) Omer Gottesman, Fredrik Johansson, Matthieu Komorowski, Aldo Faisal, David Sontag, Finale Doshi-Velez, and Leo Anthony Celi. Guidelines for reinforcement learning in healthcare. Nature Medicineine, 25(1):16–18, 2019.
  • Guez et al. (2012) Arthur Guez, David Silver, and Peter Dayan. Efficient Bayes-adaptive reinforcement learning using sample-based search. Advances in Neural Information Processing Systems, 25, 2012.
  • Guo et al. (2022) Kaiyang Guo, Yunfeng Shao, and Yanhui Geng. Model-based offline reinforcement learning with pessimism-modulated dynamics belief. arXiv preprint arXiv:2210.06692, 2022.
  • Hoeffding (1994) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. The collected works of Wassily Hoeffding, pages 409–426, 1994.
  • Iyengar (2005) Garud N Iyengar. Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280, 2005.
  • Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  • Joshi et al. (2021) Shalmali Joshi, Sonali Parbhoo, and Finale Doshi-Velez. Pre-emptive learning-to-defer for sequential medical decision-making under uncertainty. arXiv preprint arXiv:2109.06312, 2021.
  • Kendall and Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017.
  • Kidambi et al. (2020) Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning. Advances in Neural Information Processing Systems, 33:21810–21823, 2020.
  • Komorowski et al. (2018) Matthieu Komorowski, Leo A Celi, Omar Badawi, Anthony C Gordon, and A Aldo Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicineine, 24(11):1716–1720, 2018.
  • Kotz et al. (2004) Samuel Kotz, Narayanaswamy Balakrishnan, and Norman L Johnson. Continuous multivariate distributions, Volume 1: Models and applications, volume 1. John Wiley & Sons, 2004.
  • Kumar et al. (2019) Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. Advances in Neural Information Processing Systems, 32, 2019.
  • Lee et al. (2018) Gilwoo Lee, Brian Hou, Aditya Mandalika, Jeongseok Lee, Sanjiban Choudhury, and Siddhartha S Srinivasa. Bayesian policy optimization for model uncertainty. arXiv preprint arXiv:1810.01014, 2018.
  • Li et al. (2020) Luchen Li, Ignacio Albert-Smet, and Aldo A Faisal. Optimizing medical treatment for sepsis in intensive care: from reinforcement learning to pre-trial evaluation. arXiv preprint arXiv:2003.06474, 2020.
  • Mannor et al. (2007) Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.
  • Nilim and Ghaoui (2003) Arnab Nilim and Laurent Ghaoui. Robustness in Markov decision problems with uncertain transition matrices. Advances in Neural Information Processing Systems, 16, 2003.
  • Poupart et al. (2006) Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete Bayesian reinforcement learning. In Proceedings of the 23rd International Conference on Machine Learning, pages 697–704, 2006.
  • Puterman (2014) Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 2014. ISBN 9781118625873.
  • Rigter et al. (2021) Marc Rigter, Bruno Lacerda, and Nick Hawes. Risk-averse Bayes-adaptive reinforcement learning. Advances in Neural Information Processing Systems, 34:1142–1154, 2021.
  • Robert et al. (2007) Christian P Robert et al. The Bayesian choice: from decision-theoretic foundations to computational implementation, volume 2. Springer, 2007.
  • Sobel (1982) Matthew J Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Wiesemann et al. (2013) Wolfram Wiesemann, Daniel Kuhn, and Berç Rustem. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183, 2013.
  • Xu and Mannor (2006) Huan Xu and Shie Mannor. The robustness-performance tradeoff in Markov decision processes. Advances in Neural Information Processing Systems, 19, 2006.
  • Yu et al. (2020) Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  • Zintgraf et al. (2021) Luisa Zintgraf, Sebastian Schulze, Cong Lu, Leo Feng, Maximilian Igl, Kyriacos Shiarlis, Yarin Gal, Katja Hofmann, and Shimon Whiteson. Varibad: variational Bayes-adaptive deep rl via meta-learning. The Journal of Machine Learning Research, 22(1):13198–13236, 2021.

Appendix A Probabilistic evaluation bounds

Here we provide a quantitative investigation into the choice of method to evaluate the quantities of interest for a given policy, including a comparison of the probabilistic bounds on the errors due to finite numbers of samples. We compare the efficiency required to achieve an evaluation within a certain accuracy ε𝜀\varepsilonitalic_ε with a minimum probability 1δ1𝛿1-\delta1 - italic_δ for methods that (i) carry out Monte Carlo sampling for every evaluation step and (ii) (ours, Algorithm 1) carry out an exact calculation of the return distribution moments and then Monte Carlo evaluation with samples from the dynamics posterior. The quantity we investigate in detail is the Bayesian value at a given state for a given policy (appearing in Eq. 8 for a given policy and state), and since aleatoric and epistemic uncertainties are calculated in very similar fashion, the conclusions regarding Bayesian value estimation will also carry through to the uncertainty quantification case.

A.1 Exact moments

The quantity of interest we wish to approximate is

V^=𝔼(V(s)),^𝑉subscript𝔼subscript𝑉𝑠\hat{V}=\mathbb{E_{\mathcal{M}}}(V_{\mathcal{M}}(s)),over^ start_ARG italic_V end_ARG = blackboard_E start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) ) , (9)

where the expectation is taken over the Dirichlet posterior of MDP dynamics parameters. For a given set of dynamics parameters \mathcal{M}caligraphic_M, we have access to the closed form expression for the first moment of the return distribution V(s)subscript𝑉𝑠V_{\mathcal{M}}(s)italic_V start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT ( italic_s ) (in terms of policy, dynamics and reward) as presented in Eq. 2.

We assume a bounded reward |r|rmax𝑟subscript𝑟max|r|\leq r_{\text{max}}| italic_r | ≤ italic_r start_POSTSUBSCRIPT max end_POSTSUBSCRIPT and employ the well-known form of the Hoeffding inequality Hoeffding [1994] valid for the random variable Sn=i=1nXisubscript𝑆𝑛superscriptsubscript𝑖1𝑛subscript𝑋𝑖S_{n}=\sum_{i=1}^{n}{X_{i}}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bounded and i.i.d. such that 𝔼(Sn)=μ𝔼subscript𝑆𝑛𝜇\mathbb{E}(S_{n})=\mublackboard_E ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_μ:

(|Snμ|ϵ)12exp(2ϵ2nΔ2)subscript𝑆𝑛𝜇italic-ϵ122superscriptitalic-ϵ2𝑛superscriptΔ2\mathbb{P}(|S_{n}-\mu|\leq\epsilon)\geq 1-2\exp{\left(-\frac{2\epsilon^{2}}{n% \Delta^{2}}\right)}blackboard_P ( | italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_μ | ≤ italic_ϵ ) ≥ 1 - 2 roman_exp ( - divide start_ARG 2 italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n roman_Δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) (10)

with ΔΔ\Deltaroman_Δ being the size of the interval on which X𝑋Xitalic_X can take values.

In context, we take Xi=1NMVisubscript𝑋𝑖1subscript𝑁𝑀subscript𝑉𝑖X_{i}=\frac{1}{N_{M}}V_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the closed-form expression for the value of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT of the NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT dynamics samples, so μ=V^𝜇^𝑉\mu=\hat{V}italic_μ = over^ start_ARG italic_V end_ARG. From the boundedness assumption on the reward, we can also bound |Vi|rmax1γ=Vmaxsubscript𝑉𝑖subscript𝑟max1𝛾subscript𝑉max|V_{i}|\leq\frac{r_{\text{max}}}{1-\gamma}=V_{\text{max}}| italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ divide start_ARG italic_r start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG = italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT and Δ2Vmax/NMΔ2subscript𝑉maxsubscript𝑁𝑀\Delta\leq 2V_{\text{max}}/N_{M}roman_Δ ≤ 2 italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT / italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. We require enough samples so that with probability at least 1δ1𝛿1-\delta1 - italic_δ the error in our approximation of V^^𝑉\hat{V}over^ start_ARG italic_V end_ARG is within ϵitalic-ϵ\epsilonitalic_ϵ of the true value. By the Hoeffding inequality, we can ensure this is the case by choosing NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT such that

δ2exp(NMϵ22Vmax2),𝛿2subscript𝑁𝑀superscriptitalic-ϵ22superscriptsubscript𝑉max2\delta\leq 2\exp{\left(-\frac{N_{M}\epsilon^{2}}{2V_{\text{max}}^{2}}\right)},italic_δ ≤ 2 roman_exp ( - divide start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (11)

which corresponds to the smallest integer NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT such that

NMlog(2δ)(2Vmax2ϵ2).subscript𝑁𝑀2𝛿2superscriptsubscript𝑉max2superscriptitalic-ϵ2N_{M}\geq\log\left(\frac{2}{\delta}\right)\left(\frac{2V_{\text{max}}^{2}}{% \epsilon^{2}}\right).italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ≥ roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) ( divide start_ARG 2 italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (12)

A.2 Monte-Carlo sampling

The alternative method to using closed-form expressions for the moments of the return distribution given an MDP sample would be to in turn approximate these through Monte Carlo samples, as done in Joshi et al. [2021]. To do so, given the infinite horizon nature of the MDPs we are considering, we would have to accumulate rewards over a roll-out with a finite number of steps T𝑇Titalic_T, thus incurring in some error, which can be bounded above by γTVmaxsuperscript𝛾𝑇subscript𝑉max\gamma^{T}V_{\text{max}}italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. Note that the tightness of this bound will depend entirely on the reward structure of the MDP, and that this is not a source of error that can be reduced by repeatedly sampling transitions. For the purposes of the analysis presented, we will be generous in mostly ignoring the computational cost associated with sampling trajectories for a given MDP. In practice, sampling from a categorical distribution (i.e. sampling the trajectories for a given MDP) is significantly faster than sampling from a Dirichlet distribution (i.e. sampling the transition matrix), so we incorporate the overall computational cost of trajectory sampling into the modest condition that T𝑇Titalic_T cannot be arbitrarily large, but assume infinite trajectory sampling capability otherwise. This assumption allows us to determine the value for the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT given MDP arbitrarily accurately up to this error, so that the distance between the true value Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the accumulated finite sum of rewards Visuperscriptsubscript𝑉𝑖V_{i}^{\prime}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT will be bounded by |ViVi|γTVmaxsubscript𝑉𝑖subscriptsuperscript𝑉𝑖superscript𝛾𝑇subscript𝑉max|V_{i}-V^{\prime}_{i}|\leq\gamma^{T}V_{\text{max}}| italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT.

Thus, we can consider the distance

|V^1NMiVi|^𝑉1subscript𝑁𝑀subscript𝑖subscriptsuperscript𝑉𝑖\displaystyle\left|\hat{V}-\frac{1}{N_{M}}\sum_{i}V^{\prime}_{i}\right|| over^ start_ARG italic_V end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | |V^1NMiVi|+|1NMiVi1NMiVi|absent^𝑉1subscript𝑁𝑀subscript𝑖subscript𝑉𝑖1subscript𝑁𝑀subscript𝑖subscript𝑉𝑖1subscript𝑁𝑀subscript𝑖subscriptsuperscript𝑉𝑖\displaystyle\leq\left|\hat{V}-\frac{1}{N_{M}}\sum_{i}V_{i}\right|+\left|\frac% {1}{N_{M}}\sum_{i}V_{i}-\frac{1}{N_{M}}\sum_{i}V^{\prime}_{i}\right|≤ | over^ start_ARG italic_V end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + | divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | (13)
|V^1NMiVi|+γTVmax,absent^𝑉1subscript𝑁𝑀subscript𝑖subscript𝑉𝑖superscript𝛾𝑇subscript𝑉max\displaystyle\leq\left|\hat{V}-\frac{1}{N_{M}}\sum_{i}V_{i}\right|+\gamma^{T}V% _{\text{max}},≤ | over^ start_ARG italic_V end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT , (14)

so that if

|V^1NMiVi|+γTVmaxϵ,^𝑉1subscript𝑁𝑀subscript𝑖subscript𝑉𝑖superscript𝛾𝑇subscript𝑉maxitalic-ϵ\left|\hat{V}-\frac{1}{N_{M}}\sum_{i}V_{i}\right|+\gamma^{T}V_{\text{max}}\leq\epsilon,| over^ start_ARG italic_V end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | + italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ≤ italic_ϵ , (15)

with probability at least 1δ1𝛿1-\delta1 - italic_δ, then the distance to the original estimate also satisfies

|V^1NMiVi|ϵ.^𝑉1subscript𝑁𝑀subscript𝑖subscriptsuperscript𝑉𝑖italic-ϵ\left|\hat{V}-\frac{1}{N_{M}}\sum_{i}V^{\prime}_{i}\right|\leq\epsilon.| over^ start_ARG italic_V end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_ϵ . (16)

with at least probability 1δ1𝛿1-\delta1 - italic_δ.

As such, we apply the Hoeffding inequality in the form

(|V^1NMiVi|ϵγTVmax)12exp(NM2(ϵγTVmax)22Vmax2).^𝑉1subscript𝑁𝑀subscript𝑖subscript𝑉𝑖italic-ϵsuperscript𝛾𝑇subscript𝑉max12superscriptsubscript𝑁𝑀2superscriptitalic-ϵsuperscript𝛾𝑇subscript𝑉max22superscriptsubscript𝑉max2\mathbb{P}\left(\left|\hat{V}-\frac{1}{N_{M}}\sum_{i}V_{i}\right|\leq\epsilon-% \gamma^{T}V_{\text{max}}\right)\geq 1-2\exp{\left(-\frac{N_{M}^{2}(\epsilon-% \gamma^{T}V_{\text{max}})^{2}}{2V_{\text{max}}^{2}}\right)}.blackboard_P ( | over^ start_ARG italic_V end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_ϵ - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) ≥ 1 - 2 roman_exp ( - divide start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ϵ - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) . (17)

Note that this also imposes a minimum horizon truncation of T>log(ϵ/Vmax)/logγ𝑇italic-ϵsubscript𝑉max𝛾T>\log(\epsilon/V_{\text{max}})/\log\gammaitalic_T > roman_log ( italic_ϵ / italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) / roman_log italic_γ. Explicitly including the probability threshold δ𝛿\deltaitalic_δ now corresponds to finding an NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT such that

δ2exp(NM2(ϵγTVmax)22Vmax2),𝛿2superscriptsubscript𝑁𝑀2superscriptitalic-ϵsuperscript𝛾𝑇subscript𝑉max22superscriptsubscript𝑉max2\delta\leq 2\exp{\left(-\frac{N_{M}^{2}(\epsilon-\gamma^{T}V_{\text{max}})^{2}% }{2V_{\text{max}}^{2}}\right)},italic_δ ≤ 2 roman_exp ( - divide start_ARG italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_ϵ - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (18)

so

NMlog(2δ)2Vmax2(ϵγTVmax)2.subscript𝑁𝑀2𝛿2superscriptsubscript𝑉max2superscriptitalic-ϵsuperscript𝛾𝑇subscript𝑉max2N_{M}\geq\log\left(\frac{2}{\delta}\right)\frac{2V_{\text{max}}^{2}}{(\epsilon% -\gamma^{T}V_{\text{max}})^{2}}.italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ≥ roman_log ( divide start_ARG 2 end_ARG start_ARG italic_δ end_ARG ) divide start_ARG 2 italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_ϵ - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (19)

This bound corresponds to a worsening by a factor of (1γTVmax/ε)2superscript1superscript𝛾𝑇subscript𝑉max𝜀2(1-\gamma^{T}V_{\text{max}}/\varepsilon)^{-2}( 1 - italic_γ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT max end_POSTSUBSCRIPT / italic_ε ) start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT in the number of samples required to get comparable accuracy to the method that uses exact moments. For example, for the gridworld setup considered (γ=0.999,rmax=1formulae-sequence𝛾0.999subscript𝑟max1\gamma=0.999,r_{\text{max}}=1italic_γ = 0.999 , italic_r start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 1 and positing ϵ=0.001italic-ϵ0.001\epsilon=0.001italic_ϵ = 0.001) would require an order of magnitude of T105𝑇superscript105T\approx 10^{5}italic_T ≈ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT for every rolled out trajectory, (of which we are assuming to be able to carry out an arbitrarily large number to obtain this bound) at which point the contribution of the trajectory sampling to the bottleneck would be severe and require a completely different bound to take it into account. Thus, for the regime we consider, choosing to compute exact moments does save computation towards the computational bottleneck of taking samples from a Dirichlet posterior.

Note that aleatoric and epistemic uncertainty will behave similarly: aleatoric variance is an analogous expectation over the second instead of first moment (which we again can have in closed-form or can estimate through Monte Carlo samples) and the bound will be analogous. Similarly, for epistemic variance the error in return due to truncated trajectories will compound when calculating the variance over expected returns, and again we expect a similarly greater number of samples for NMsubscript𝑁𝑀N_{M}italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT.

Appendix B Stochastic Optimal Policy

Here we provide an illustrative example of how the Bayesian objective Eq. 8 for expected value when MDPs are sampled from some distribution may not have a deterministic optimal policy.

Consider the following ‘casino’ MDP with three states. State s𝑠sitalic_s corresponds to the player being in the casino, where the possible actions are to play or leave. Being in state s𝑠sitalic_s costs the player 1 unit of currency every time that state s𝑠sitalic_s is visited. The outcome of leaving is to deterministically transition to a terminal state with no further rewards. On the other hand, playing has a stochastic outcome, with a probability θ𝜃\thetaitalic_θ of losing, in which case the player remains in state s𝑠sitalic_s, and a probability 1θ1𝜃1-\theta1 - italic_θ of winning, in which case the player transitions to state w𝑤witalic_w, where they receive a payout of R𝑅Ritalic_R units and then deterministically transition to the terminal state with no further rewards. Thus, each realisation of θ𝜃\thetaitalic_θ corresponds to a slightly different MDP with different probabilities of winning and therefore different optimal policies.

For a policy that plays with probability π𝜋\piitalic_π and leaves with probability 1π1𝜋1-\pi1 - italic_π, the Bellman equation for value of state s𝑠sitalic_s, V𝑉Vitalic_V, under this policy is

V=1+γ(πθV+π(1θ)R).𝑉1𝛾𝜋𝜃𝑉𝜋1𝜃𝑅V=-1+\gamma(\pi\theta V+\pi(1-\theta)R).italic_V = - 1 + italic_γ ( italic_π italic_θ italic_V + italic_π ( 1 - italic_θ ) italic_R ) . (20)

Solving for V𝑉Vitalic_V gives the value starting from state s𝑠sitalic_s for a specific MDP:

V=1+γπ(1θ)R1γπθ.𝑉1𝛾𝜋1𝜃𝑅1𝛾𝜋𝜃V=\frac{-1+\gamma\pi(1-\theta)R}{1-\gamma\pi\theta}.italic_V = divide start_ARG - 1 + italic_γ italic_π ( 1 - italic_θ ) italic_R end_ARG start_ARG 1 - italic_γ italic_π italic_θ end_ARG . (21)

We now consider the expected value when θ𝜃\thetaitalic_θ is sampled from some distribution. For example, if θ𝜃\thetaitalic_θ is sampled from a Bernoulli distribution with parameter p=12𝑝12p=\frac{1}{2}italic_p = divide start_ARG 1 end_ARG start_ARG 2 end_ARG, the expected value of π𝜋\piitalic_π over this distribution of MDPs is

V=12(1+γπR11γπ).𝑉121𝛾𝜋𝑅11𝛾𝜋V=\frac{1}{2}\left(-1+\gamma\pi R-\frac{1}{1-\gamma\pi}\right).italic_V = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( - 1 + italic_γ italic_π italic_R - divide start_ARG 1 end_ARG start_ARG 1 - italic_γ italic_π end_ARG ) . (22)

We visualise this value as a function of π𝜋\piitalic_π in Fig. 6 for R=10𝑅10R=10italic_R = 10, γ=0.99𝛾0.99\gamma=0.99italic_γ = 0.99. The maximum value is not achieved at π=0𝜋0\pi=0italic_π = 0 or π=1𝜋1\pi=1italic_π = 1, but rather at π0.69𝜋0.69\pi\approx 0.69italic_π ≈ 0.69. A policy that never plays achieves a value of 1111, a policy that always plays a value of 45.5545.55-45.55- 45.55 and the optimal (stochastic) policy a value of about 1.341.341.341.34. Thus, we have an example where there is no deterministic optimal policy.

Refer to caption
Figure 6: Plot of average value across a distribution of casino MDPs as a function of policy.

Appendix C Computational scalability

We present in Fig.7 empirical results with regards to scaling our method to larger state-spaces. The experiment we benchmark is the one on synthetic MPDs, as carried out in section 5.2 but with varying state-space sizes for two different posterior sample batch sizes. We also consider both the case where the posterior is resampled every gradient step (as in the synthetic MDP experiments) as well as the case where the posterior is only sampled once at the start of training. While the latter is not suggested in practice, the resulting copmutation time bounds the compute time that can be saved by resampling periodically between gradient steps rather than at every step during training.

Refer to caption
Figure 7: Computation time required to run Algorithm 2 for 1000 gradient steps. We display results for batches of sizes 8 and 256, for the case where the posterior is sampled before each gradient step (resampling) or only once at the start of training (no resampling).

Appendix D Policy uncertainty evaluation

The policy we present and compare results for is the policy that optimises the maximum likelihood estimate (MLE) of the transition dynamics MDP, where transition probability is taken to be the relative frequency of observed transitions, which we refer to as the MLE-optimal policy.

Running SARSA policy evaluation on the methods proposed in Festor et al. [2021] explicitly shows that the epistemic uncertainty in the dynamics transition is not captured by the ensemble method used. Fig. 8 shows that with this setup, epistemic uncertainty correlates with loss but is independent of amount of data observed. This is visible as the curves collapse to small epistemic uncertainty values irrespective of data set size even though the amount of data in the smallest data set size (25) is smaller than the total number of transitions of the MDP (80). This is because it captures information on parametric training uncertainty but not of the dynamics model uncertainty.

Refer to caption
Figure 8: Plot of the epistemic uncertainty and loss as a function of training timestep demonstrating that epistemic is not accurately tracked by previous methods. Epistemic standard deviation (top row, red data) is quantified here over 10k time steps, corresponding to the agent carrying out transitions over many episodes. The corresponding ensemble quantile regression loss (bottom row, blue data) at each training timestep is shown below. Here we show as examplar the results for fixed policy using ensemble methods with a MLE-dynamics model for different number of observed transitions in the dataset generated by the gridworld with prand=0.5subscript𝑝rand0.5p_{\text{rand}}=0.5italic_p start_POSTSUBSCRIPT rand end_POSTSUBSCRIPT = 0.5. The value that the epistemic standard deviation converges to is always small for all visited states and independent of dataset size as the only notion of uncertainty captured in this setup is one of parametric uncertainty and not MDP uncertainty.

Appendix E Multi Sample Backward Induction

We apply here a variant of the method MSBI presented in Dimitrakakis [2011], which outputs a policy that is near-optimal with respect to the Bayes-adaptive optimal policy under some assumptions. The main assumption is that the belief change between timesteps is bounded, and the authors show that a backwards induction greedy algorithm can yield a policy that has Bayesian utility which lower bounds the adaptive Bayes-optimal utility within an error term proportional to this bound. This optimal adaptive utility will, however, in general be different to the posterior value expectation that we are aiming to maximise with our policy, so the algorithm’s purpose is not entirely aligned with ours. Nonetheless, in practice the proposed algorithm involves carrying out an analogous version of value iteration where at each timestep the iterative value is given by also marginalising over MDPs. This is also possible to implement and test in our setting so we provide here the relevant results for completeness, although we emphasise that the theoretical foundations and guarantees of near-optimality don’t apply to our specific case. For a fairer comparison to the gradient-based methods, we also use the nominal policy as the starting policy in this algorithm.

Gridworld

The work suggests a number of posterior samples of the order of (ϵ(1γ))31014superscriptitalic-ϵ1𝛾3superscript1014\left(\epsilon(1-\gamma)\right)^{-3}\approx 10^{14}( italic_ϵ ( 1 - italic_γ ) ) start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ≈ 10 start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT using γ=0.999𝛾0.999\gamma=0.999italic_γ = 0.999 and an error tolerance on the value of ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01, which is a computationally intractable number of samples to store and process for transition matrices. Thus, we use a number of samples (NM=32768subscript𝑁𝑀32768N_{M}=32768italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = 32768) and maximum number of iterations (2000)2000(2000)( 2000 ) that roughly match the computation time of the gradient-optimised policy (30-60s depending on dataset without GPU acceleration for the gridworld experiments). In Fig. 9(a) we show the performance of MSBI on the relative posterior expected value objective for the same gridworld as in section 5.1 over 50 runs, where regions above the red line correspond to improved posterior value maximisation with our algorithm. As may be expected from the gap between the algorithm’s original intention and our application, MSBI consistently underperforms with the exception of high data regimes, where the probability mass collapses on one MDP and the algorithm essentially reduces to value iteration.

Synthetic MDPs

We also apply MSBI to synthetic MDPs as presented in section 5.2 and report results in Figs.9(b) and 9(c). Once again, due to the large number of experiments ran (250 runs each for each of the 10 different dataset sizes) we had to reduce the number of posterior samples to be NM=2048subscript𝑁𝑀2048N_{M}=2048italic_N start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT = 2048 and fix the maximum number of iterations to 500500500500.

Refer to caption

(a) Gradient vs MSBI gridworld posterior expected value
Refer to caption
(b) Gradient vs MSBI synthetic MDP ground truth performance (shaded standard error of the mean)

Refer to caption

(c) Gradient vs MSBI synthetic MDP posterior expected value
Figure 9: Relative performance of MSBI on gridworld posterior expected value objective over 50 runs (Fig. 9(a)) and synthetic MDP ground truth performance (Fig. 9(b), with shaded standard error of the mean) and posterior value objective (Fig. 9(c)) over 250 runs.

Appendix F Synthetic MDPs relative performance

We display in Figs. 10 and 11 results corresponding to those presented in Figs. 3 and  4 but with the y𝑦yitalic_y-axis scaled by the value achieved by our method (resulting in a different scaling value for each dataset size), so that the new resulting plot can be interpreted as a fractional relative improvement.

Refer to caption
(a) Gradient vs MLE-optimal
Refer to caption
(b) Gradient vs Nominal
Refer to caption
(c) Gradient vs Second order
Figure 10: Ground truth pairwise difference in average performance (and shaded standard error of the mean) normalised by average performance of the Gradient (ours) method for each dataset. Regions above the red line correspond to improved performance with our method.
Refer to caption
(a) Gradient vs MLE-optimal
Refer to caption
(b) Gradient vs Nominal
Refer to caption
(c) Gradient vs Second order
Figure 11: Average and standard deviation (shaded) of posterior expected value normalised by average performance of the Gradient (ours) method for each dataset. Regions above the red line correspond to improved performance with our method.

Appendix G Clinical data discussion

G.1 General discussion

Bayesian inference with Dirichlet distributions with a large number of possible outcomes (next states) is problematic, as mentioned in section 3 [Friedman and Singer, 1998], and careful thought must be given to what prior to employ. First we consider a Bayesian model selection approach: we assume all possible states are reachable and symmetric. This allows us to optimise the model evidence with respect to the unique parameter αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT of the prior, in the hope that specifying a prior which is more in line with the observations will lead to better inference (see Appendix G.2 for details). As expected, the optimal αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is found to be much smaller than 1, αp=0.072subscript𝛼𝑝0.072\alpha_{p}=0.072italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.072, giving less weight after inference to the prior than the maximum-entropy αp=1subscript𝛼𝑝1\alpha_{p}=1italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 prior does. However, this approach still fails to accurately model our belief, which can be seen by considering the following scenario: suppose the patient is in a bad state and has two options, namely (a) try a treatment that has been attempted many times with rare success or (b) try a treatment that has always gone wrong, but has been tried a small number of times so has high uncertainty in the outcome. Option (b) is clearly not appealing, but the agent’s posterior will still place significant probability mass on unobserved states in the presence of a small number of transitions, thus highly encouraging the agent to take the less visited action and assigning it a disproportionately high value. Upon inspection, this is exactly what is happening in the outlier state in Fig. 5a (at approximate coordinates (0.6,0.8)0.60.8(0.6,0.8)( 0.6 , 0.8 )), and the value given by this Bayesian posterior is likely unreasonable.

To address this, we introduce conservatism by considering only observed states and the death state as next possible states, thus ensuring a more conservative prior. Inducing conservatism in offline RL with datasets that do not adequately cover the full state-action space is in line with literature [Agarwal et al., 2020, Kumar et al., 2019], and conservative MDP models have found success in continuous offline RL by modulating reward [Yu et al., 2020, Kidambi et al., 2020] or dynamics [Guo et al., 2022], somewhat analogously to what is being proposed here. By only including observed or negative outcomes, the agent is unable to place probability mass on unsupported next-states and therefore use high uncertainty to inflate the value of poorly visited actions in bad states. The scarcity of outcomes allows for meaningful inference using a maximum-entropy prior with αp=1subscript𝛼𝑝1\alpha_{p}=1italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1, and a high-entropy prior is favorable from a conservatism standpoint. It encourages the agent to select actions that have sufficient support to offset the high prior probability mass assigned to the death state. The Bayesian values inferred with this setup are presented in Fig. 5b. Fig. 5 shows the possible improvement, according to the Bayesian posterior value, of employing the Bayesian gradient-optimised policy compared to the MLE-optimal policy used in Komorowski et al. [2018], resulting in higher probability of survival (according to the dynamics model). In particular, we note that employing the gradient-optimised policy improves the value, and therefore corresponding approximate probability of survival, by about 2.1%percent2.12.1\%2.1 % when averaged across states, with a maximum improvement on a particular state of 17.8%percent17.817.8\%17.8 %, according to the conservative Bayesian dynamics model.

In Fig. 12 we show how the MIMIC-III states aleatoric and epistemic uncertainties are related. The values are computed using the same conservative dynamics model of Fig. 5b.

Refer to caption
Figure 12: States plotted according to their epistemic and aleatoric standard deviations. Each dot represents a state, with its colour corresponding to its average value according to the Bayesian posterior.

As expected for the particular reward structure of the MDP considered, aleatoric uncertainty and average Bayesian value are strongly related: since the return variable is approximately binomial (approximately 1111 for success and 00 for failure) its mean and variance are related straightforwardly. Note this will not be true for MDPs with more general reward structures.

G.2 Bayesian model selection

To determine the prior that for the dynamics model with results presented in Fig. 5a, we carry out Bayesian model selection by minimising the negative log-marginal likelihood of the data with respect to the parameter αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. To remain consistent with the limitation that only actions observed at least 5 times in the data should be employed at each state, we only use the data for such state-action transitions when determining the optimal αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

For each state-action, the full form of the Dirichlet prior in terms of αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is Friedman and Singer [1998]

p({θs,asj|si𝒮})=Γ(|𝒮|αp)Γ(αp)|𝒮|j(θs,asj)αp1,𝑝conditional-setsuperscriptsubscript𝜃𝑠𝑎subscript𝑠𝑗subscript𝑠𝑖𝒮Γ𝒮subscript𝛼𝑝Γsuperscriptsubscript𝛼𝑝𝒮subscriptproduct𝑗superscriptsuperscriptsubscript𝜃𝑠𝑎subscript𝑠𝑗subscript𝛼𝑝1p(\{\theta_{s,a}^{s_{j}}|s_{i}\in\mathcal{S}\})=\frac{\Gamma(|\mathcal{S}|% \alpha_{p})}{\Gamma(\alpha_{p})^{|\mathcal{S}|}}\prod_{j}(\theta_{s,a}^{s_{j}}% )^{\alpha_{p}-1},italic_p ( { italic_θ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_S } ) = divide start_ARG roman_Γ ( | caligraphic_S | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT , (23)

where ΓΓ\Gammaroman_Γ is the gamma function. The likelihood is

p(𝒟|θ)=j(θs,asj)nj,𝑝conditional𝒟𝜃subscriptproduct𝑗superscriptsuperscriptsubscript𝜃𝑠𝑎subscript𝑠𝑗subscript𝑛𝑗p(\mathcal{D}|\theta)=\prod_{j}(\theta_{s,a}^{s_{j}})^{n_{j}},italic_p ( caligraphic_D | italic_θ ) = ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (24)

with njsubscript𝑛𝑗n_{j}italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT being the number of observed transitions from state-action s,a𝑠𝑎s,aitalic_s , italic_a to state sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Hence, the model evidence is

p(𝒟)𝑝𝒟\displaystyle p(\mathcal{D})italic_p ( caligraphic_D ) =𝑑θp(𝒟|θ)p(θ)absentdifferential-d𝜃𝑝conditional𝒟𝜃𝑝𝜃\displaystyle=\int d\theta p(\mathcal{D}|\theta)p(\theta)= ∫ italic_d italic_θ italic_p ( caligraphic_D | italic_θ ) italic_p ( italic_θ ) (25)
=Γ(|𝒮|αp)Γ(αp)|𝒮|jΓ(αp+nj)|𝒮|Γ(|𝒮|αp+Ns,a),absentΓ𝒮subscript𝛼𝑝Γsuperscriptsubscript𝛼𝑝𝒮subscriptproduct𝑗Γsuperscriptsubscript𝛼𝑝subscript𝑛𝑗𝒮Γ𝒮subscript𝛼𝑝subscript𝑁𝑠𝑎\displaystyle=\frac{\Gamma(|\mathcal{S}|\alpha_{p})}{\Gamma(\alpha_{p})^{|% \mathcal{S}|}}\frac{\prod_{j}\Gamma(\alpha_{p}+n_{j})^{|\mathcal{S}|}}{\Gamma(% |\mathcal{S}|\alpha_{p}+N_{s,a})},= divide start_ARG roman_Γ ( | caligraphic_S | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT end_ARG divide start_ARG ∏ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Γ ( italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT | caligraphic_S | end_POSTSUPERSCRIPT end_ARG start_ARG roman_Γ ( | caligraphic_S | italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT ) end_ARG , (26)

with Ns,asubscript𝑁𝑠𝑎N_{s,a}italic_N start_POSTSUBSCRIPT italic_s , italic_a end_POSTSUBSCRIPT being the number of observed transitions from state-action s,a𝑠𝑎s,aitalic_s , italic_a. Since transitions are independent across state-actions, taking the negative logarithm of this quantity and summing across all state-actions results in the overall negative log-marginal likelihood for the dataset in terms of αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. The resulting function of αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is visualised in Fig.13 and attains a minimum value at approximately αp=0.072subscript𝛼𝑝0.072\alpha_{p}=0.072italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 0.072.

Refer to caption
Figure 13: Negative log-marginal likelihood for clinical data dynamics model against parameter αpsubscript𝛼𝑝\alpha_{p}italic_α start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT of the prior.