Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Accelerating Point-Based Value Iteration via Active Sampling of Belief Points and Gaussian Process Regression

Siqiong Zhou School of Computing and Augmented Intelligence, Arizona State University, USA Ashif S. Iquebal School of Computing and Augmented Intelligence, Arizona State University, USA Esma S. Gel College of Business, University of Nebraska-Lincoln, USA
Abstract

Partially Observable Markov Decision Processes (POMDPs) are fundamental to decision-making under uncertainty. We introduce a novel scalable approach to accelerate upper bound estimation in Point-Based Value Iteration (PBVI) algorithms, the leading method to solve large-scale POMDPs. PBVI approximates the value function using a set of belief points rather than the entire continuous belief space and relies on lower and upper bounds for convergence. While lower bounds are straightforward to compute, PVBI requires repeated sawtooth projection operations to approximate the upper bound convex hull, significantly increasing the computational burden although many of these sawtooth projections become redundant as the belief set expands. To address this, we infer the upper bound using the upper confidence bound of a Gaussian Process Regression (GP-UCB) fitted over a subset of the most informative reachable belief points–the ones that exhibit linear independence in some high-dimensional Hilbert space. This approach reduces the number of sawtooth projections by 84.3% on average without compromising the solution quality. We further establish the theoretical consistency of the proposed GP-UCB estimate of the upper bound and show convergence to the true upper bound convex hull. We implement GP-UCB and test its performance using five benchmark finite-horizon POMDPs, demonstrating its effectiveness in estimating upper bounds and improving PBVI performance. GP-UCB reduces computation time by 30% to 60% on smaller problems and up to 99.7% on larger ones, while achieving the same gaps as the pure sawtooth projection method.

Keywords:partially observable Markov decision processes, Gaussian process regression, upper confidence bound, point-based value iteration

1 Introduction

A Partially Observable Markov Decision Process (POMDP) extends the classic Markov Decision Process (MDP) by addressing situations where the state of the system is only partially observable (Kaelbling et al. 1998). In real-world applications, decision makers often lack perfect information about the current state of the system, which complicates both control and information gathering tasks. This partial observability introduces an inherent uncertainty that must be effectively managed for optimal decision making. The POMDP framework is uniquely suited to this challenge, as it allows decisions to be made based on probabilistic estimates of the state of the system, known as belief states.

The significance of solving POMDPs lies in their applicability to a wide range of complex real-world problems. POMDPs have been applied in fields such as reliability and maintenance (Cassandra 1998, Papakonstantinou and Shinozuka 2014, Kim et al. 2018, Song et al. 2022), aircraft collision avoidance (Temizer et al. 2010, Bai et al. 2012, Mueller and Kochenderfer 2016), and medical decision making (Vozikis and Goulionis 2009, Ayer et al. 2012, Zois 2016, Zhang and Wang 2022). In each of these domains, the ability to account for uncertain observations and dynamically update belief states based on new information is critical for success.

Although POMDPs offer a powerful theoretical framework, solving them exactly for large-scale problems remains computationally infeasible. This limitation arises from the curse of dimensionality, as the size of the reachable belief space grows exponentially with the number of states and observations (Papadimitriou and Tsitsiklis 1987, Madani et al. 1999). As a result, approximate methods have been developed to provide practical solutions. Among these, one of the most successful is the Point-Based Value Iteration (PBVI) algorithm (Pineau et al. 2003), which approximates the value function by focusing on a representative set of belief points rather than computing it over the entire belief space, which is a probability simplex, where each belief 𝒃𝒃\bm{b}bold_italic_b is a convex combination of the entire unit vectors corresponding to each state.

The purpose of this paper is to advance the current body of knowledge on solving finite-horizon POMDPs efficiently. In particular, we propose a novel method that accelerates the calculation of upper bounds in PBVI by using a Gaussian Process Upper Confidence Bound (GP-UCB). This method offers a data-efficient approximation strategy to improve the computational tractability of solving POMDPs in complex, large-scale environments. Our contribution is particularly significant in finite-horizon problems, where the dynamically changing value function requires repeated updates to both lower and upper bounds, the latter being computationally prohibitive to estimate. To this end, we employ Gaussian Process Regression (GPR) to learn a model of the upper bound estimates for a subset of the most informative belief points. The trained GPR model is then used to predict the entire upper bound convex hull, reducing the computational complexity while maintaining high-quality approximations. GP-UCB is proposed as a conservative estimate of the upper bound convex hull. The main contributions of our work are as follows.

  1. 1.

    The novel use of the GPR for the upper bound approximation, leveraging the non-parametric approximation capabilities of Gaussian Processes (GP)(Section 2).

  2. 2.

    The novel use of active learning for the iterative selection of the most informative belief points, referred to as support beliefs, to train the GPR. Training the GPR on this subset of the belief set reduces the computational cost from 𝒪(N3)𝒪superscript𝑁3\mathcal{O}(N^{3})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) to 𝒪(Nd2)𝒪𝑁superscript𝑑2\mathcal{O}(Nd^{2})caligraphic_O ( italic_N italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where N𝑁Nitalic_N is the number of belief points and dNmuch-less-than𝑑𝑁d\ll Nitalic_d ≪ italic_N is the number of support beliefs.

  3. 3.

    Theoretical guarantees, backed by formal proofs, showing that the upper bounds generated by our method converge to the true upper bound convex hull, leading to increasingly accurate approximations as the belief set expands.

Through extensive numerical experiments, we demonstrate that our method outperforms existing PBVI algorithms in both convergence speed and scalability across a range of complex problem domains and horizon lengths. Our experiments are conducted on well-established test problems, including instances with up to 90 states, 29 actions, and 3 observations. GP-UCB significantly improves computational efficiency in these settings, reducing computation time by 30%-60% on smaller problems and up to 99.7% on larger ones while maintaining the same gaps as the pure sawtooth projection method.

While POMDP formulations can involve even larger state and action spaces, our experiments exceed the scale of many Operations Research applications (e.g., in medical decision-making and maintenance planning), including recent ones, which typically consider formulations with at most 6 states, 3 actions, and 4 observations (Lin et al. 2004, Ayer et al. 2012, Liu et al. 2022, Hajjar and Alagoz 2023, Gong and Liu 2023, Li et al. 2023, Deep et al. 2023). By demonstrating strong performance on significantly larger problems than those addressed in these domains, our approach contributes to the OR literature by providing a scalable and computationally efficient tool for solving practical finite-horizon POMDPs.

Section 2 provides a description of the main problem that we consider, followed by a detailed explanation of the use of GPR to infer upper bounds in Section 3. Some readers may find the background given in Appendix A useful before proceeding with these sections since it provides an overview of POMDP solution approaches to date, with a focus on key components of state-of-the-art PBVI algorithms. We demonstrate our effectiveness claims in Section 5 through the use of an extensive set of numerical experiments on finite-horizon problems described in Section 4. The paper concludes with a summary and comments for future research directions in Section 6.

2 Problem Description

A POMDP is expressed by a tuple (𝒮,𝒜,𝒪,Θ,Ω,R,𝒃𝟎)𝒮𝒜𝒪ΘΩ𝑅subscript𝒃0({\cal S},{\cal A},{\cal O},\Theta,\Omega,R,\bm{b_{0}})( caligraphic_S , caligraphic_A , caligraphic_O , roman_Θ , roman_Ω , italic_R , bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ), where an agent occupies one of the possible states of the system s𝒮𝑠𝒮s\in\cal Sitalic_s ∈ caligraphic_S, which cannot be observed directly. The system changes from one state to the next after the agent takes an action a𝒜𝑎𝒜a\in\cal Aitalic_a ∈ caligraphic_A. Function Θ:𝒮×𝒜×𝒮[0,1]:Θ𝒮𝒜𝒮01{\Theta:{\cal S}\times{\cal A}\times{\cal S}\rightarrow\left[0,1\right]}roman_Θ : caligraphic_S × caligraphic_A × caligraphic_S → [ 0 , 1 ] represents the stochastic state transitions. Specifically, Θ(s,a,s)=P(s|s,a)Θ𝑠𝑎superscript𝑠𝑃conditionalsuperscript𝑠𝑠𝑎{\Theta(s,a,s^{\prime})=P(s^{\prime}|s,a)}roman_Θ ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) denotes the probability transition function of state changes from s𝒮𝑠𝒮s\in\cal Sitalic_s ∈ caligraphic_S to s𝒮superscript𝑠𝒮s^{\prime}\in\cal Sitalic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S after performing the corresponding action a𝒜𝑎𝒜a\in\cal Aitalic_a ∈ caligraphic_A. Since the states are not observable directly, the agent makes certain observations o𝒪𝑜𝒪o\in\cal Oitalic_o ∈ caligraphic_O, which are imperfect projections of the states. The observation probability is defined by the function Ω:𝒜×𝒮×𝒪[0,1]:Ω𝒜𝒮𝒪01\Omega:{\cal A}\times{\cal S}\times{\cal O}\rightarrow\left[0,1\right]roman_Ω : caligraphic_A × caligraphic_S × caligraphic_O → [ 0 , 1 ]. Specifically, Ω(a,s,o)=P(o|a,s)Ω𝑎superscript𝑠𝑜𝑃conditional𝑜𝑎superscript𝑠\Omega(a,s^{\prime},o)=P(o|a,s^{\prime})roman_Ω ( italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_o ) = italic_P ( italic_o | italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the probability of observing o𝒪𝑜𝒪o\in\cal Oitalic_o ∈ caligraphic_O after taking action a𝒜𝑎𝒜a\in\cal Aitalic_a ∈ caligraphic_A, under the true state s𝒮superscript𝑠𝒮s^{\prime}\in\cal{S}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S. In addition, the agent’s action results in rewards. Function R𝒮×𝒜𝑅𝒮𝒜R\in{\cal S}\times{\cal A}\rightarrow\mathbb{R}italic_R ∈ caligraphic_S × caligraphic_A → blackboard_R denotes the reward function. Reward R(s,a)𝑅𝑠𝑎R(s,a)italic_R ( italic_s , italic_a ) is received after taking action a𝒜𝑎𝒜a\in\cal Aitalic_a ∈ caligraphic_A in state s𝒮𝑠𝒮s\in\cal Sitalic_s ∈ caligraphic_S. The planning horizon T𝑇Titalic_T specifies the finite time steps during which the agent seeks to maximize its cumulative reward. Let vector 𝒃=(b(1),b(2),,b(|𝒮|))𝒃𝑏1𝑏2𝑏𝒮{\bm{b}=(b(1),b(2),\dots,b(|\cal S|))}bold_italic_b = ( italic_b ( 1 ) , italic_b ( 2 ) , … , italic_b ( | caligraphic_S | ) ) denote the belief state where b(s)𝑏𝑠b(s)italic_b ( italic_s ) is the probability that the true system state is s𝒮𝑠𝒮s\in{\cal S}italic_s ∈ caligraphic_S. Starting from belief 𝒃𝒃\bm{b}bold_italic_b at time t𝑡titalic_t, the agent updates belief to 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT at time t+1𝑡1t+1italic_t + 1 after executing action a𝑎aitalic_a and observing o𝑜oitalic_o as

b(s)=P(o|a,s)P(o|b,a)s𝒮P(s|s,a)b(s)=P(o|a,s)s𝒮P(s|s,a)b(s)s𝒮P(o|a,s)s𝒮P(s|s,a)b(s),s𝒮.formulae-sequencesuperscript𝑏superscript𝑠𝑃conditional𝑜𝑎superscript𝑠𝑃conditional𝑜𝑏𝑎subscript𝑠𝒮𝑃conditionalsuperscript𝑠𝑠𝑎𝑏𝑠𝑃conditional𝑜𝑎superscript𝑠subscript𝑠𝒮𝑃conditionalsuperscript𝑠𝑠𝑎𝑏𝑠subscriptsuperscript𝑠𝒮𝑃conditional𝑜𝑎superscript𝑠subscript𝑠𝒮𝑃conditionalsuperscript𝑠𝑠𝑎𝑏𝑠for-all𝑠𝒮\displaystyle b^{\prime}(s^{\prime})=\frac{P(o|a,s^{\prime})}{P(o|b,a)}\sum_{s% \in{\cal S}}P(s^{\prime}|s,a)b(s)=\frac{P(o|a,s^{\prime})\sum_{s\in{\cal S}}P(% s^{\prime}|s,a)b(s)}{\sum_{s^{\prime}\in{\cal S}}P(o|a,s^{\prime})\sum_{s\in{% \cal S}}P(s^{\prime}|s,a)b(s)}~{},\forall s\in\cal S~{}.italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG italic_P ( italic_o | italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_P ( italic_o | italic_b , italic_a ) end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_b ( italic_s ) = divide start_ARG italic_P ( italic_o | italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_b ( italic_s ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_o | italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_b ( italic_s ) end_ARG , ∀ italic_s ∈ caligraphic_S . (2.1)

Naturally, s𝒮b(s)=1subscript𝑠𝒮𝑏𝑠1\sum_{s\in\cal S}b(s)=1∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) = 1. The initial belief, 𝒃𝟎subscript𝒃0\bm{b_{0}}bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, provides the probability distribution of the state at the beginning of the planning horizon. The beliefs derived from an initial belief, via a feasible sequence of actions and observations are called reachable belief points.

Figure 1 shows the belief states that are reachable from the initial belief state, 𝒃𝟎=(0.5,0.5)subscript𝒃00.50.5{\bm{b_{0}}=(0.5,0.5)}bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT = ( 0.5 , 0.5 ) in one stage for the well-known tiger problem presented by Kaelbling et al. (1998), considering three actions, Listen (a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), Open Left Door (a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), Open Right Door (a3subscript𝑎3a_{3}italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT), and two observations resulting from each action, under the assumption that opening a (left or right) door restarts the problem and resets the belief to (0.5,0.5)0.50.5(0.5,0.5)( 0.5 , 0.5 ). Appendix B contains a detailed description of the tiger problem.

Refer to caption
Figure 1: Tree structure of a two-stage tiger problem (Kaelbling et al. 1998). Circles represent belief states observed right before taking an action in each stage.

A policy π𝜋\piitalic_π provides the sequence of actions to be taken over the planning horizon as a function of the belief state. The value function represents the expected cumulative reward obtained under the optimal policy, πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. For a finite horizon problem, the following backward recursion computes the value function, Vt(𝒃)subscript𝑉𝑡𝒃V_{t}(\bm{b})italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_b ), as,

Vt(𝒃)=maxa𝒜[s𝒮b(s)R(s,a)+o𝒪P(o|𝒃,a)Vt+1π(𝒃)],𝒃𝓉,𝓉{0,1,,𝒯1},formulae-sequencesubscript𝑉𝑡𝒃𝑎𝒜delimited-[]subscript𝑠𝒮𝑏𝑠𝑅𝑠𝑎subscript𝑜𝒪𝑃conditional𝑜𝒃𝑎superscriptsubscript𝑉𝑡1superscript𝜋superscript𝒃bold-′formulae-sequencefor-all𝒃subscript𝓉𝓉01𝒯1\displaystyle V_{t}(\bm{b})=\underset{a\in{\cal A}}{\max}\left[\sum_{s\in{\cal S% }}b(s)R(s,a)+\sum_{o\in{\cal O}}P(o|\bm{b},a)V_{t+1}^{\pi^{*}}(\bm{b^{\prime}}% )\right],\forall~{}\bm{b}\in\cal{B}_{t},~{}t\in\{0,1,\dots,T-1\}~{},italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_b ) = start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG [ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) italic_R ( italic_s , italic_a ) + ∑ start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT italic_P ( italic_o | bold_italic_b , italic_a ) italic_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ) ] , ∀ bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT , caligraphic_t ∈ { caligraphic_0 , caligraphic_1 , … , caligraphic_T - caligraphic_1 } , (2.2)

where 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT denotes the |𝒮|𝒮|\cal S|| caligraphic_S |-dimensional belief space. The optimal policy for belief state 𝒃𝒃\bm{b}bold_italic_b at time period t𝑡titalic_t is defined as

πt(𝒃)=argmaxa𝒜[s𝒮b(s)R(s,a)+o𝒪P(o|𝒃,a)Vt+1(𝒃)].superscriptsubscript𝜋𝑡𝒃𝑎𝒜delimited-[]subscript𝑠𝒮𝑏𝑠𝑅𝑠𝑎subscript𝑜𝒪𝑃conditional𝑜𝒃𝑎subscript𝑉𝑡1superscript𝒃bold-′\displaystyle\pi_{t}^{*}(\bm{b})=\underset{a\in{\cal A}}{\arg\max}\left[\sum_{% s\in{\cal S}}b(s)R(s,a)+\sum_{o\in{\cal O}}P(o|\bm{b},a)V_{t+1}(\bm{b^{\prime}% })\right]~{}.italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_b ) = start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG [ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) italic_R ( italic_s , italic_a ) + ∑ start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT italic_P ( italic_o | bold_italic_b , italic_a ) italic_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ) ] . (2.3)

Sondik (1971) showed that the value function is represented exactly by a piecewise-linear and convex function such that the value function for a specific period is represented by a set of |𝒮|𝒮|\cal S|| caligraphic_S |-dimensional α𝛼\alphaitalic_α-vectors, Γt={α0,α1,}subscriptΓ𝑡superscript𝛼0superscript𝛼1\Gamma_{t}=\{\alpha^{0},\alpha^{1},...\}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_α start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_α start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … }. That is,

Vt(𝒃)=maxαΓts𝒮b(s)α(s),t{0,1,,T1}.formulae-sequencesubscript𝑉𝑡𝒃𝛼subscriptΓ𝑡subscript𝑠𝒮𝑏𝑠𝛼𝑠𝑡01𝑇1\displaystyle V_{t}(\bm{b})=\underset{\alpha\in\Gamma_{t}}{\max}\sum_{s\in{% \cal S}}b(s)\alpha(s)~{},~{}t\in\{0,1,\dots,T-1\}~{}.italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_b ) = start_UNDERACCENT italic_α ∈ roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) italic_α ( italic_s ) , italic_t ∈ { 0 , 1 , … , italic_T - 1 } . (2.4)

Hence, the value function given in Eqn. (2.2) can be rewritten as

Vt(𝒃)=maxa𝒜[s𝒮b(s)R(s,a)+o𝒪P(o|𝒃,a)maxαt+1Γt+1(s𝒮b(s)αt+1(s))].subscript𝑉𝑡𝒃𝑎𝒜delimited-[]subscript𝑠𝒮𝑏𝑠𝑅𝑠𝑎subscript𝑜𝒪𝑃conditional𝑜𝒃𝑎subscript𝛼𝑡1subscriptΓ𝑡1subscriptsuperscript𝑠𝒮superscript𝑏superscript𝑠subscript𝛼𝑡1superscript𝑠\displaystyle V_{t}(\bm{b})=\underset{a\in{\cal A}}{\max}\left[\sum_{s\in{\cal S% }}b(s)R(s,a)+\sum_{o\in{\cal O}}P(o|\bm{b},a)\underset{\alpha_{t+1}\in\Gamma_{% t+1}}{\max}\left(\sum_{s^{\prime}\in{\cal S}}b^{\prime}(s^{\prime})\alpha_{t+1% }(s^{\prime})\right)\right]~{}.italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_b ) = start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG [ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) italic_R ( italic_s , italic_a ) + ∑ start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT italic_P ( italic_o | bold_italic_b , italic_a ) start_UNDERACCENT italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ roman_Γ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG ( ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] . (2.5)

This representation allows the value function to be computed recursively using a set of α𝛼\alphaitalic_α-vectors. At each step, the backup operation updates the set of α𝛼\alphaitalic_α-vectors, ΓtsubscriptΓ𝑡\Gamma_{t}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on the current belief state 𝒃𝒃\bm{b}bold_italic_b. Specifically, ΓtsubscriptΓ𝑡\Gamma_{t}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is updated with

Γt=𝒃tBackup(𝒃,t),subscriptΓ𝑡subscript𝒃subscript𝑡Backup𝒃𝑡\displaystyle\Gamma_{t}=\bigcup_{\bm{b}\in\mathcal{B}_{t}}\text{Backup}(\bm{b}% ,t)~{},roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT Backup ( bold_italic_b , italic_t ) , (2.6)

where the backup operation for each belief 𝒃𝒃\bm{b}bold_italic_b is given by

Backup(𝒃,t)=argmaxgab,a𝒜[𝒃gab],Backup𝒃𝑡superscriptsubscript𝑔𝑎𝑏for-all𝑎𝒜delimited-[]𝒃superscriptsubscript𝑔𝑎𝑏\displaystyle\text{Backup}(\bm{b},t)=\underset{g_{a}^{b},\forall a\in{\cal A}}% {\arg\max}\left[{\bm{b}\cdot g_{a}^{b}}\right]~{},Backup ( bold_italic_b , italic_t ) = start_UNDERACCENT italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , ∀ italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_arg roman_max end_ARG [ bold_italic_b ⋅ italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ] , (2.7)

where gabsuperscriptsubscript𝑔𝑎𝑏g_{a}^{b}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT for action a𝑎aitalic_a is computed as

gab(s)={R(s,a)+o𝒪argmaxαt+1Γt+1s𝒮b(s)s𝒮P(s|s,a)P(o|a,s)αt+1(s),t<T,R(s,a),t=T.superscriptsubscript𝑔𝑎𝑏𝑠cases𝑅𝑠𝑎subscript𝑜𝒪subscript𝛼𝑡1subscriptΓ𝑡1subscript𝑠𝒮𝑏𝑠subscriptsuperscript𝑠𝒮𝑃conditionalsuperscript𝑠𝑠𝑎𝑃conditional𝑜𝑎superscript𝑠subscript𝛼𝑡1superscript𝑠𝑡𝑇𝑅𝑠𝑎𝑡𝑇\displaystyle g_{a}^{b}(s)=\begin{cases}R(s,a)+\sum_{o\in\mathcal{O}}\underset% {\alpha_{t+1}\in\Gamma_{t+1}}{\arg\max}\sum_{s\in\mathcal{S}}b(s)\sum_{s^{% \prime}\in\mathcal{S}}P(s^{\prime}|s,a)P(o|a,s^{\prime})\alpha_{t+1}(s^{\prime% }),&t<T~{},\\ R(s,a),&t=T~{}.\end{cases}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ( italic_s ) = { start_ROW start_CELL italic_R ( italic_s , italic_a ) + ∑ start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT start_UNDERACCENT italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ roman_Γ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_max end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_P ( italic_o | italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , end_CELL start_CELL italic_t < italic_T , end_CELL end_ROW start_ROW start_CELL italic_R ( italic_s , italic_a ) , end_CELL start_CELL italic_t = italic_T . end_CELL end_ROW (2.8)

PBVI Algorithm:

A key idea of PBVI algorithms is sampling a representative set of belief points to approximate the value function. For these sampled belief points, both a lower bound Vt¯¯subscript𝑉𝑡\underline{V_{t}}under¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and an upper bound Vt¯¯subscript𝑉𝑡\overline{V_{t}}over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG of the value function are maintained. The gap between the lower and upper bounds at reachable belief points affects the value function approximation. Thus, the PBVI algorithm implemented by Walraven and Spaan (2019) selects the belief points with the largest gap, focusing on areas that can most effectively improve the value function approximation. The approach prioritizes the sampling of belief points where the gap is maximum in the time step t+1𝑡1t+1italic_t + 1.

Algorithm 1 outlines the PBVI algorithm incorporating lower and upper bound calculations. Following Walraven and Spaan (2019), the initial belief set includes corner beliefs, which are the belief points where all probability mass is concentrated on a single state. Mathematically, these are the unit vectors 𝒘ssubscript𝒘𝑠\bm{w}_{s}bold_italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where 𝒘ssubscript𝒘𝑠\bm{w}_{s}bold_italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is a vector with 1111 in the sthsuperscript𝑠𝑡s^{th}italic_s start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT entry and 00 elsewhere. These points define the boundaries of the belief space, and hence are useful for initializing the PBVI and defining the reachable belief set.

Algorithm 1 Point-based Value Iteration (PBVI) Algorithm, generalized from Walraven and Spaan (2019), Spaan and Vlassis (2005), Lovejoy (1991a)
1:Input: POMDP model, initial belief point 𝒃𝟎subscript𝒃0\bm{b_{0}}bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT, initial belief set 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT containing all corner beliefs for t{0,1,,T1}𝑡01𝑇1t\in\{0,1,\dots,T-1\}italic_t ∈ { 0 , 1 , … , italic_T - 1 }, and convergence threshold, ϵitalic-ϵ\epsilonitalic_ϵ
2:Output: Lower bound Vt¯(𝒃)¯subscript𝑉𝑡𝒃\underline{V_{t}}(\bm{b})under¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) and upper bound Vt¯(𝒃)¯subscript𝑉𝑡𝒃\overline{V_{t}}(\bm{b})over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) of value function for 𝒃𝓉for-all𝒃subscript𝓉\forall\bm{b}\in\cal{B}_{t}∀ bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT for t{0,1,,T1}𝑡01𝑇1t\in\{0,1,\dots,T-1\}italic_t ∈ { 0 , 1 , … , italic_T - 1 }
3:Initialize Vt¯(𝒃)¯subscript𝑉𝑡𝒃\underline{V_{t}}(\bm{b})under¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) and Vt¯(𝒃)¯subscript𝑉𝑡𝒃\overline{V_{t}}(\bm{b})over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) for 𝒃𝓉for-all𝒃subscript𝓉\forall\bm{b}\in\cal{B}_{t}∀ bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT for t{0,1,,T1}𝑡01𝑇1t\in\{0,1,\dots,T-1\}italic_t ∈ { 0 , 1 , … , italic_T - 1 } using Eqn. (2.5) and Eqn. (C.3)
4:while V0¯(𝒃𝟎)V0¯(𝒃𝟎)>ϵ¯subscript𝑉0subscript𝒃0¯subscript𝑉0subscript𝒃0italic-ϵ\overline{V_{0}}(\bm{b_{0}})-\underline{V_{0}}(\bm{b_{0}})>\epsilonover¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) - under¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) > italic_ϵ  do
5:    Choose a sampling method: Max-Gap, Random or Fixed-Grid
6:    if Sampling method = “Max-Gap” then
7:         Set 𝒃=𝒃𝟎𝒃subscript𝒃0\bm{b}=\bm{b_{0}}bold_italic_b = bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT
8:         for t=0𝑡0t=0italic_t = 0 to T2𝑇2T-2italic_T - 2 do
9:             Choose the action aargmaxa𝒜(R(a)𝒃+o𝒪P(o|𝒃,a)Vt+1¯(𝒃))𝑎subscript𝑎𝒜𝑅𝑎𝒃subscript𝑜𝒪𝑃conditional𝑜𝒃𝑎¯subscript𝑉𝑡1superscript𝒃bold-′a\leftarrow\arg\max_{a\in{\cal A}}\left(R(a)\cdot\bm{b}+\sum_{o\in{\cal O}}P(o% |\bm{b},a)\cdot\overline{V_{t+1}}(\bm{b^{\prime}})\right)italic_a ← roman_arg roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT ( italic_R ( italic_a ) ⋅ bold_italic_b + ∑ start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT italic_P ( italic_o | bold_italic_b , italic_a ) ⋅ over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ) )
10:             Choose the observation oargmaxo𝒪(Vt+1¯(𝒃)Vt+1¯(𝒃))𝑜subscript𝑜𝒪¯subscript𝑉𝑡1𝒃¯subscript𝑉𝑡1𝒃o\leftarrow\arg\max_{o\in{\cal O}}\left(\overline{V_{t+1}}(\bm{b})-\underline{% V_{t+1}}(\bm{b})\right)italic_o ← roman_arg roman_max start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT ( over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) - under¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) )
11:             Sample new belief, 𝒃newsubscript𝒃new\bm{b_{\text{new}}}bold_italic_b start_POSTSUBSCRIPT new end_POSTSUBSCRIPT using the selected a𝑎aitalic_a, o𝑜oitalic_o with Eqn. (2.1) and add to 𝓉+1subscript𝓉1\cal{B}_{t+1}caligraphic_B start_POSTSUBSCRIPT caligraphic_t + caligraphic_1 end_POSTSUBSCRIPT
12:             Set 𝒃=𝒃new𝒃subscript𝒃new\bm{b}=\bm{b_{\text{new}}}bold_italic_b = bold_italic_b start_POSTSUBSCRIPT new end_POSTSUBSCRIPT
13:         end for      (The max-gap method from Walraven and Spaan (2019))
14:    else if Sampling method = “Random” then
15:         for t=1𝑡1t=1italic_t = 1 to T1𝑇1T-1italic_T - 1 do
16:             Sample new belief, 𝒃newsubscript𝒃new\bm{b_{\text{new}}}bold_italic_b start_POSTSUBSCRIPT new end_POSTSUBSCRIPT, uniformly from belief space and add to 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT, e.g., random sampling method proposed by Spaan and Vlassis (2005)
17:         end for
18:    else if Sampling method = “Fixed-Grid” then
19:         Use fixed belief grid for 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT for t{0,1,,T1}𝑡01𝑇1t\in\{0,1,\dots,T-1\}italic_t ∈ { 0 , 1 , … , italic_T - 1 }, e.g., fixed grid proposed by Lovejoy (1991a)
20:    end if
21:    for t=T1𝑡𝑇1t=T-1italic_t = italic_T - 1 to 00 do
22:         Update Vt¯(𝒃)¯subscript𝑉𝑡𝒃\underline{V_{t}}(\bm{b})under¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) for 𝒃𝓉for-all𝒃subscript𝓉\forall\bm{b}\in\cal{B}_{t}∀ bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT using Eqn. (2.5)
23:         Update Vt¯(𝒃)¯subscript𝑉𝑡𝒃\overline{V_{t}}(\bm{b})over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) for 𝒃𝓉for-all𝒃subscript𝓉\forall\bm{b}\in\cal{B}_{t}∀ bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT with Eqn. (C.3) using an approximation method, e.g., sawtooth
24:         Prune dominated α𝛼\alphaitalic_α-vectors
25:    end for
26:end while

We now discuss the computation of Vt¯(𝒃)¯subscript𝑉𝑡𝒃\underline{V_{t}}(\bm{b})under¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) and Vt¯(𝒃)¯subscript𝑉𝑡𝒃\overline{V_{t}}(\bm{b})over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) for a belief point, 𝒃𝓉𝒃subscript𝓉\bm{b}\in\cal{B}_{t}bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT. While the former can be easily obtained by taking the best α𝛼\alphaitalic_α-vector at any belief point (Hauskrecht 2000), obtaining the upper bound is computationally more challenging. In finite-horizon POMDPs, the challenge is even greater because the value function evolves dynamically over time (Smallwood and Sondik 1973). Unlike in infinite-horizon settings—where a stationary policy optimizes the value function indefinitely—finite-horizon problems require recomputing the value function at each time step, as the number of remaining decisions directly influences both immediate and future rewards (Pineau et al. 2003). This makes the calculation of upper bounds particularly expensive, since upper bounds must be recomputed at each time step to reflect the evolving decision horizon. Determining an exact upper bound requires optimistically evaluating all possible future outcomes, significantly increasing computational cost (Walraven and Spaan 2019, Smith and Simmons 2005).

Lovejoy (1991a) presented some of the earliest implementation of upper bounds using linear interpolation method using a grid of belief points obtained via Freudenthal triangulation. Subsequently, Hauskrecht (2000) presented an upper bound based on the convex hull projection while considering the upper bound improvement with incremental addition of new belief points. Since the value function is piecewise convex, the upper bound for a new belief point is obtained by projecting the belief point to the convex hull obtained by the existing belief-upper bound value pairs. For the remainder of the paper, we refer to this as convex hull. Linear programming is used to identify the best convex hull by minimizing the linear combination of the existing upper bounds. Given the repetitive nature of updating upper bounds, the computational cost of executing a large number of linear programming steps quickly becomes intractable.

Hauskrecht (2000) proposed an interpolation approach for efficient approximation of the convex hull, referred to as sawtooth, to reduce computational complexity. Given an arbitrary belief 𝒃𝒃\bm{b}\in\cal{B}bold_italic_b ∈ caligraphic_B and corner beliefs, the sawtooth projection v¯()¯𝑣\bar{v}(\cdot)over¯ start_ARG italic_v end_ARG ( ⋅ ) provides the upper bound approximation for any new belief point. The detailed description of the sawtooth projection algorithm can be found in Appendix C.

To further demonstrate the sawtooth projection, consider the tiger problem again. In each subfigure of Figure 2, the orange dots represent belief points in the current belief set \cal{B}caligraphic_B with known upper bounds. In the tiger problem, corner beliefs correspond to states where the tiger is surely behind either the left or the right door (i.e., belief states (1.00,0.00)1.000.00(1.00,0.00)( 1.00 , 0.00 ) or (0.00,1.00)0.001.00(0.00,1.00)( 0.00 , 1.00 )). Connecting non-corner orange belief points with these corner beliefs forms downward-pointing triangles, illustrated by the orange dashed lines. When a new belief 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT outside \cal{B}caligraphic_B is encountered, its v¯(𝒃)¯𝑣superscript𝒃bold-′\bar{v}(\bm{b^{\prime}})over¯ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ) is estimated by projecting it onto these connecting lines, selecting the projection with the smallest value (marked with red dots). For example, when updating the upper bound for belief points at t=3𝑡3t=3italic_t = 3, such as (0.5,0.5)0.50.5(0.5,0.5)( 0.5 , 0.5 ), Eqn. (C.3) is applied, requiring v¯()¯𝑣\bar{v}(\cdot)over¯ start_ARG italic_v end_ARG ( ⋅ ) for reachable beliefs at the subsequent stage, t=4𝑡4t=4italic_t = 4. In this case, the reachable belief points are (0.5,0.5)0.50.5(0.5,0.5)( 0.5 , 0.5 ), (0.15,0.85)0.150.85(0.15,0.85)( 0.15 , 0.85 ), and (0.85,0.15)0.850.15(0.85,0.15)( 0.85 , 0.15 ). Figure 2(a) shows that (0.15,0.85)0.150.85(0.15,0.85)( 0.15 , 0.85 ) is not part of the current belief set, so v¯[(0.15,0.85)]¯𝑣delimited-[]0.150.85\bar{v}[(0.15,0.85)]over¯ start_ARG italic_v end_ARG [ ( 0.15 , 0.85 ) ] is approximated using a sawtooth projection. By projecting (0.15,0.85)0.150.85(0.15,0.85)( 0.15 , 0.85 ) onto the dashed orange lines, the smallest projection value is selected, shown as the red point in Figure 2(a). For other reachable belief points whose upper bounds are known (orange points), the values are used as v¯()¯𝑣\bar{v}(\cdot)over¯ start_ARG italic_v end_ARG ( ⋅ ) directly. This process repeats as we move back to t=2𝑡2t=2italic_t = 2, t=1𝑡1t=1italic_t = 1, and t=0𝑡0t=0italic_t = 0, applying sawtooth projections to any reachable points with unknown upper bounds, as depicted by the red points in Figures 2(b)-(d).

Refer to caption
Figure 2: Approximation of the upper bound using sawtooth projection across different time stages at iteration 2 in the tiger problem.

The sawtooth projection method offers a trade-off between computational efficiency and accuracy. Using geometric properties, it efficiently approximates the convex hull, avoiding the extensive recalculations required by heuristic-based approaches such as HSVI (Smith and Simmons 2005) and SARSOP (Kurniawati et al. 2008). However, the sawtooth projection still presents significant computational complexity due to the repeated updates required to refine the upper bound each time a new belief point is added to the belief set. Initially, expanding the belief set results in significant improvements in the upper bound. But as the belief set grows, the improvement diminishes, although the number of updates, and consequently the computational complexity, continues to increase (Hauskrecht 2000).

3 Upper Bound Prediction using Gaussian Process Regression

Here, we introduce an alternative approach that approximates the upper bound convex hull by fitting a GPR to the sawtooth projections of only a subset of belief points that we refer to as the support beliefs. As noted in Hauskrecht (2000), the change in upper bounds as the belief set expands is only marginal. As a result, an “informative” subset of the belief set is typically sufficient to model the convex hull.

Recall that under the PBVI algorithm presented by Walraven and Spaan (2019), at any stage of upper bound updates for a set of belief points, we execute the backup operation as shown in Eqn. (C.3), which involves computing the sawtooth projection for all the reachable beliefs. Instead, we compute the sawtooth projection only for the support beliefs and predict for the rest of the belief points using a GPR. The support belief set is initialized with |𝒮|+1𝒮1|\mathcal{S}|+1| caligraphic_S | + 1 random belief points and iteratively expanded using an approximate linear dependence criterion discussed below. Finally, to ensure that our approximation provides a conservative estimate of the convex hull, we consider the upper confidence bound of the GPR during the inference stage.

GPR learns an interpolation of the convex hull using a set of m𝑚mitalic_m arbitrary reachable belief points and the corresponding sawtooth projections {𝒃i,v¯}i=1msuperscriptsubscriptsubscript𝒃𝑖¯𝑣𝑖1𝑚\{\bm{b}_{i},\overline{v}\}_{i=1}^{m}{ bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over¯ start_ARG italic_v end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. In this framework, the sawtooth projections v¯(𝒃i)¯𝑣subscript𝒃𝑖\overline{v}(\bm{b}_{i})over¯ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represent the noisy estimates of the convex hull, h(𝒃i)subscript𝒃𝑖h(\bm{b}_{i})italic_h ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), such that v¯(𝒃i)=h(𝒃i)+ϵi¯𝑣subscript𝒃𝑖subscript𝒃𝑖subscriptitalic-ϵ𝑖\overline{v}(\bm{b}_{i})=h(\bm{b}_{i})+\epsilon_{i}over¯ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_h ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where ϵi𝒩(𝟎,σϵ2)similar-tosubscriptitalic-ϵ𝑖𝒩0subscriptsuperscript𝜎2italic-ϵ{\epsilon_{i}\sim\mathcal{N}(\mathbf{0},\sigma^{2}_{\epsilon})}italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ) is assumed to be normally distributed noise with variance σϵ2subscriptsuperscript𝜎2italic-ϵ\sigma^{2}_{\epsilon}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT, representing the variability in the overestimation of the sawtooth projections. Hence, the sawtooth projection of the training set follows a multivariate Gaussian distribution, i.e., 𝒗¯m[v¯(𝒃1),,v¯(𝒃m)]T𝒩(𝟎,𝐊(𝑩m,𝑩m)+σϵ2𝐈)subscript¯𝒗𝑚superscript¯𝑣subscript𝒃1¯𝑣subscript𝒃𝑚𝑇similar-to𝒩0𝐊subscript𝑩𝑚subscript𝑩𝑚subscriptsuperscript𝜎2italic-ϵ𝐈\overline{\bm{v}}_{m}\equiv[\overline{v}(\bm{b}_{1}),\ldots,\overline{v}(\bm{b% }_{m})]^{T}\sim\mathcal{N}\left(\mathbf{0},\mathbf{K}(\bm{B}_{m},\bm{B}_{m})+% \sigma^{2}_{\epsilon}\mathbf{I}\right)over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≡ [ over¯ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , over¯ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT bold_I ), where 𝑩m={𝒃i}i=1msubscript𝑩𝑚superscriptsubscriptsubscript𝒃𝑖𝑖1𝑚\bm{B}_{m}={\{\bm{b}_{i}\}_{i=1}^{m}}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the support belief set and 𝐊(𝑩m,𝑩m)𝐊subscript𝑩𝑚subscript𝑩𝑚\mathbf{K}(\bm{B}_{m},\bm{B}_{m})bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) denotes the covariance matrix. For any new belief point 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the joint distribution of known upper bounds 𝒗¯msubscript¯𝒗𝑚\overline{\bm{v}}_{m}over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and estimated upper bound v^(𝒃k)^𝑣subscript𝒃𝑘\widehat{v}(\bm{b}_{k})over^ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is Gaussian, i.e.

[𝒗¯mv^(𝒃k)]𝒩(𝟎,[𝐊(𝑩m,𝑩m)+σϵ2𝐈𝐊(𝑩m,𝒃k)𝐊(𝒃k,𝑩m)𝐊(𝒃k,𝒃k)])similar-tomatrixsubscript¯𝒗𝑚^𝑣subscript𝒃𝑘𝒩0matrix𝐊subscript𝑩𝑚subscript𝑩𝑚subscriptsuperscript𝜎2italic-ϵ𝐈𝐊subscript𝑩𝑚subscript𝒃𝑘𝐊subscript𝒃𝑘subscript𝑩𝑚𝐊subscript𝒃𝑘subscript𝒃𝑘\displaystyle\begin{bmatrix}\overline{\bm{v}}_{m}\\ \widehat{v}(\bm{b}_{k})\end{bmatrix}\sim\mathcal{N}\left(\mathbf{0},\begin{% bmatrix}\mathbf{K}(\bm{B}_{m},\bm{B}_{m})+\sigma^{2}_{\epsilon}\mathbf{I}&% \mathbf{K}(\bm{B}_{m},\bm{b}_{k})\\ \mathbf{K}(\bm{b}_{k},\bm{B}_{m})&\mathbf{K}(\bm{b}_{k},\bm{b}_{k})\end{% bmatrix}\right)[ start_ARG start_ROW start_CELL over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL over^ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] ∼ caligraphic_N ( bold_0 , [ start_ARG start_ROW start_CELL bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT bold_I end_CELL start_CELL bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL bold_K ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) end_CELL start_CELL bold_K ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] ) (3.1)

where 𝒗¯msubscript¯𝒗𝑚\overline{\bm{v}}_{m}over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represents the upper bound at the current belief set 𝑩msubscript𝑩𝑚\bm{B}_{m}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and v^(𝒃k)^𝑣subscript𝒃𝑘\widehat{v}(\bm{b}_{k})over^ start_ARG italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is the estimated upper bound at the new belief point 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.Conditioning on the joint distribution yields the predicted mean value μ(𝒃k)𝜇subscript𝒃𝑘\mu(\bm{b}_{k})italic_μ ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) of the convex hull at 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with the corresponding variance σ2(𝒃k)superscript𝜎2subscript𝒃𝑘\sigma^{2}(\bm{b}_{k})italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) at a new belief point 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as

μ(𝒃k)𝜇subscript𝒃𝑘\displaystyle\mu(\bm{b}_{k})italic_μ ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =𝐊(𝑩m,𝒃k)T[𝐊(𝑩m,𝑩m)+σϵ2𝐈]1𝒗¯m,absent𝐊superscriptsubscript𝑩𝑚subscript𝒃𝑘𝑇superscriptdelimited-[]𝐊subscript𝑩𝑚subscript𝑩𝑚subscriptsuperscript𝜎2italic-ϵ𝐈1subscript¯𝒗𝑚\displaystyle=\mathbf{K}(\bm{B}_{m},\bm{b}_{k})^{T}\left[\mathbf{K}(\bm{B}_{m}% ,\bm{B}_{m})+\sigma^{2}_{\epsilon}\mathbf{I}\right]^{-1}\overline{\bm{v}}_{m}~% {},= bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT bold_I ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , (3.2)
σ2(𝒃k)superscript𝜎2subscript𝒃𝑘\displaystyle\sigma^{2}(\bm{b}_{k})italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =𝐊(𝒃k,𝒃k)𝐊(𝑩m,𝒃k)T[𝐊(𝑩m,𝑩m)+σϵ2𝐈]1𝐊(𝑩m,𝒃k).absent𝐊subscript𝒃𝑘subscript𝒃𝑘𝐊superscriptsubscript𝑩𝑚subscript𝒃𝑘𝑇superscriptdelimited-[]𝐊subscript𝑩𝑚subscript𝑩𝑚subscriptsuperscript𝜎2italic-ϵ𝐈1𝐊subscript𝑩𝑚subscript𝒃𝑘\displaystyle=\mathbf{K}(\bm{b}_{k},\bm{b}_{k})-\mathbf{K}(\bm{B}_{m},\bm{b}_{% k})^{T}\left[\mathbf{K}(\bm{B}_{m},\bm{B}_{m})+\sigma^{2}_{\epsilon}\mathbf{I}% \right]^{-1}\mathbf{K}(\bm{B}_{m},\bm{b}_{k})~{}.= bold_K ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT bold_I ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (3.3)

Let us define 𝜷=[𝐊(𝑩m,𝑩m)+σϵ2𝐈]1𝒗¯m𝜷superscriptdelimited-[]𝐊subscript𝑩𝑚subscript𝑩𝑚subscriptsuperscript𝜎2italic-ϵ𝐈1subscript¯𝒗𝑚\bm{\beta}=\left[\mathbf{K}(\bm{B}_{m},\bm{B}_{m})+\sigma^{2}_{\epsilon}% \mathbf{I}\right]^{-1}\overline{\bm{v}}_{m}bold_italic_β = [ bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT bold_I ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT such that Eqn. (3.2) is rewritten as the following linear combination of kernel functions k(𝒃i,𝒃k)=[𝐊(𝑩m,𝒃k)]i𝑘subscript𝒃𝑖subscript𝒃𝑘subscriptdelimited-[]𝐊subscript𝑩𝑚subscript𝒃𝑘𝑖k(\bm{b}_{i},\bm{b}_{k})=[\mathbf{K}(\bm{B}_{m},\bm{b}_{k})]_{i}italic_k ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = [ bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT,

μ(𝒃k)=i=1mβik(𝒃i,𝒃k)=i=1mβiϕ(𝒃i),ϕ(𝒃k),𝜇subscript𝒃𝑘superscriptsubscript𝑖1𝑚subscript𝛽𝑖𝑘subscript𝒃𝑖subscript𝒃𝑘superscriptsubscript𝑖1𝑚subscript𝛽𝑖italic-ϕsubscript𝒃𝑖italic-ϕsubscript𝒃𝑘\mu(\bm{b}_{k})=\sum_{i=1}^{m}\beta_{i}k(\bm{b}_{i},\bm{b}_{k})=\sum_{i=1}^{m}% \beta_{i}\langle\phi(\bm{b}_{i}),\phi(\bm{b}_{k})\rangle~{},italic_μ ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_k ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟨ italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⟩ , (3.4)

where ϕitalic-ϕ\phiitalic_ϕ is a mapping of the points in the belief space to Hilbert space. This form of the mean GPR predictor shows that it is, in fact, a linear predictor of the convex hull in the Hilbert space. As such, if an arbitrary belief point 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies ϕ(𝒃k)=i=1mβiϕ(𝒃i)italic-ϕsubscript𝒃𝑘superscriptsubscript𝑖1𝑚subscript𝛽𝑖italic-ϕsubscript𝒃𝑖\phi(\bm{b}_{k})=\sum_{i=1}^{m}\beta_{i}\phi(\bm{b}_{i})italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), then the mean GPR prediction corresponding to 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT could be inferred directly from the beliefs 𝒃1,,𝒃msubscript𝒃1subscript𝒃𝑚\bm{b}_{1},\ldots,\bm{b}_{m}bold_italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. However, unless for the cases when the belief space is low dimensional or the belief 𝒃k=𝒃j,jmformulae-sequencesubscript𝒃𝑘subscript𝒃𝑗𝑗𝑚\bm{b}_{k}=\bm{b}_{j},j\leq mbold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ≤ italic_m, ϕ(𝒃k)italic-ϕsubscript𝒃𝑘\phi(\bm{b}_{k})italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) will be linearly independent of {ϕ(𝒃i)}i=1msuperscriptsubscriptitalic-ϕsubscript𝒃𝑖𝑖1𝑚\{\phi(\bm{b}_{i})\}_{i=1}^{m}{ italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. Although strong linear dependency does not hold, Engel et al. (2004) showed that it is possible to have an approximate linear dependency. To test whether a new belief 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT satisfies the approximately linear dependent or ALD criterion, we define

δ=defmin𝒂j=1majϕ(𝒃j)ϕ(𝒃k)2=k(𝒃k,𝒃k)𝐊(𝑩m,𝒃k)𝒂ν,superscriptdef𝛿subscript𝒂superscriptnormsuperscriptsubscript𝑗1𝑚subscript𝑎𝑗bold-italic-ϕsubscript𝒃𝑗bold-italic-ϕsubscript𝒃𝑘2𝑘subscript𝒃𝑘subscript𝒃𝑘𝐊superscriptsubscript𝑩𝑚subscript𝒃𝑘top𝒂𝜈\delta\;\stackrel{{\scriptstyle\text{def}}}{{=}}\;\min_{\boldsymbol{a}}{\Bigg{% \|}\sum_{j=1}^{m}a_{j}\boldsymbol{\phi}(\bm{b}_{j})-\boldsymbol{\phi}(\bm{b}_{% k})\Bigg{\|}}^{2}=\;k(\bm{b}_{k},\bm{b}_{k})-{\mathbf{K}}(\bm{B}_{m},\bm{b}_{k% })^{\top}\boldsymbol{a}\;\leq\;\nu~{},italic_δ start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG def end_ARG end_RELOP roman_min start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT ∥ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - bold_italic_ϕ ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_k ( bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_a ≤ italic_ν , (3.5)

where 𝒂[a1,a2,,am]T=𝐊1(𝑩m,𝑩m)𝐊(𝑩m,𝒃k)𝒂superscriptsubscript𝑎1subscript𝑎2subscript𝑎𝑚𝑇superscript𝐊1subscript𝑩𝑚subscript𝑩𝑚𝐊subscript𝑩𝑚subscript𝒃𝑘\boldsymbol{a}\equiv[a_{1},a_{2},\ldots,a_{m}]^{T}={\mathbf{K}}^{-1}(\bm{B}_{m% },\bm{B}_{m}){\mathbf{K}}(\bm{B}_{m},\bm{b}_{k})bold_italic_a ≡ [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) bold_K ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), and ν𝜈\nuitalic_ν is a threshold that determines the strength of ALD. Here, 𝐊1(𝑩m,𝑩m)superscript𝐊1subscript𝑩𝑚subscript𝑩𝑚\mathbf{K}^{-1}(\bm{B}_{m},\bm{B}_{m})bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) is the inverse of the kernel matrix for the belief set 𝑩msubscript𝑩𝑚\bm{B}_{m}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. If δν𝛿𝜈\delta\leq\nuitalic_δ ≤ italic_ν, then 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is considered ALD on the support belief 𝑩msubscript𝑩𝑚\bm{B}_{m}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. In this case, the support belief set remains unchanged. However, if δ>ν𝛿𝜈\delta>\nuitalic_δ > italic_ν, 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is not linearly dependent on 𝑩msubscript𝑩𝑚\bm{B}_{m}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, in which case the support belief 𝑩msubscript𝑩𝑚\bm{B}_{m}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is expanded to include the new belief point 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and the corresponding kernel matrices are updated accordingly.

Note, however, that one needs to repeat the GPR fitting every time a new belief point 𝒃ksubscript𝒃𝑘\bm{b}_{k}bold_italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is added to the belief set. This is because as new belief points are added, the sawtooth projections 𝒗¯msubscript¯𝒗𝑚\overline{\bm{v}}_{m}over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of existing beliefs in the training set 𝑩msubscript𝑩𝑚\bm{B}_{m}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT get tighter, leading to a more accurate upper bound approximation of the value function. However, these updated projections are not directly available and must be recomputed.

To reduce computational load, we update the GPR fit using partially revised sawtooth projections, which are then used to obtain the updated mean prediction in Eqn. (3.2), rather than retraining from scratch when upper bound changes are small. Specifically, during initial iterations, when upper bounds change significantly, the GPR model is refitted at each iteration to ensure accuracy. However, as the upper bounds stabilize in later iterations, a single belief point is randomly selected from 𝑩msubscript𝑩𝑚\bm{B}_{m}bold_italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in each iteration, and its sawtooth projection is recomputed to update the GPR fit in Eqn. (3.2). Full updates of all sawtooth projections 𝒗¯msubscript¯𝒗𝑚\overline{\bm{v}}_{m}over¯ start_ARG bold_italic_v end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT become unnecessary because upper bound improvements tend to stabilize after a few iterations, as also noted by Hauskrecht (2000) and Smith and Simmons (2012).

Refer to caption
Figure 3: Upper bound estimation using GPR and sawtooth projection.

Figure 3 provides a visualization of estimating upper bounds using GPR in the tiger problem example. Figure 3(a) shows the GPR model fit in iteration 1, and Figure 3(b) presents the fitting at iteration 5. At iteration 1, the support belief set includes three belief points with known upper bounds, depicted as orange dots, and three additional points selected based on the ALD criterion, with their respective sawtooth projections shown as green x markers. The GPR model is trained using these six belief points. The resulting convex hull inferred from the GPR model is shown in a solid blue line together with the 1σ1𝜎1\sigma1 italic_σ-confidence bound. For comparison, the solid red line shows the upper bound approximation obtained from the sawtooth projection. As the belief set expands over subsequent iterations, the GPR model is updated with the new belief points using the ALD criterion. At iteration 5, the GPR model is trained by incorporating two additional reachable belief points alongside five existing beliefs with known upper bounds. The updated fitting of the GPR model reflecting the refined upper bound approximation, is depicted in Figure 3(b).

3.1 The Use of Gaussian Process Upper Confidence Bound

The last issue we address in this work is that of GP mean reversal and its impact on the convex hull prediction. The predicted mean tends to collapse toward the prior mean in regions with no or fewer training points. Since the prior mean is assumed to be 0 in the current case, the predicted convex hull would have a tendency to sag toward the zero line. The mean reversal effect is more pronounced with smoother kernel functions such as squared exponential and Matérn, but less so with exponential. An example comparison is shown in Figure 4.

Due to the mean reversal effect, the predicted convex hull may underestimate the true one. Hence, to overcome this limitation, we apply the Upper Confidence Bound (UCB) of the predicted convex hull as the upper bound V¯t(𝒃)subscript¯𝑉𝑡𝒃\overline{V}_{t}(\bm{b})over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_b ). For any belief point 𝒃𝒃\bm{b}bold_italic_b, the upper confidence bound V¯UCB(𝒃)=μ(𝒃)+ησ(𝒃)subscript¯𝑉𝑈𝐶𝐵𝒃𝜇𝒃𝜂𝜎𝒃\overline{V}_{UCB}(\bm{b})=\mu(\bm{b})+\eta\sigma(\bm{b})over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ) = italic_μ ( bold_italic_b ) + italic_η italic_σ ( bold_italic_b ) provides a conservative estimate of the convex hull h(𝒃)𝒃{h}(\bm{b})italic_h ( bold_italic_b ), where η𝜂\etaitalic_η is a confidence parameter. In the following, we theoretically show that the proposed upper bound V¯UCB()subscript¯𝑉𝑈𝐶𝐵\overline{V}_{UCB}(\cdot)over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( ⋅ ) is a Probably Approximately Correct (PAC) estimate of the convex hull.

Refer to caption
Figure 4: Upper bound estimation using Matérn and exponential kernels in the tiger problem iteration 1.
Theorem 1.

Denoting the convex hull upper bound for belief point 𝐛𝐛\bm{b}bold_italic_b as h(𝐛)𝐛{h}(\bm{b})italic_h ( bold_italic_b ), we have

P(|h(𝒃)μ(𝒃)|ησ(𝒃))1δ for some δ(0,1).𝑃𝒃𝜇𝒃𝜂𝜎𝒃1𝛿 for some 𝛿01P\left(|{h}(\bm{b})-\mu(\bm{b})|\leq\eta\sigma(\bm{b})\right)\geq 1-\delta% \text{ for some }\delta\in(0,1)~{}.italic_P ( | italic_h ( bold_italic_b ) - italic_μ ( bold_italic_b ) | ≤ italic_η italic_σ ( bold_italic_b ) ) ≥ 1 - italic_δ for some italic_δ ∈ ( 0 , 1 ) .

In other words, as more belief points are sampled, σ(𝐛)𝜎𝐛\sigma(\bm{b})italic_σ ( bold_italic_b ) reduces and the mean prediction μ(𝐛)𝜇𝐛\mu(\bm{b})italic_μ ( bold_italic_b ) approaches the true convex hull h(𝐛)𝐛h(\bm{b})italic_h ( bold_italic_b ).

Proof.

Proof. Conditioned on a given set of belief points 𝑩={𝒃i}i=1m𝑩superscriptsubscriptsubscript𝒃𝑖𝑖1𝑚\bm{B}={\{\bm{b}_{i}\}_{i=1}^{m}}bold_italic_B = { bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and their corresponding sawtooth projections for upper bounds 𝒗¯(𝒃i)¯𝒗subscript𝒃𝑖\overline{\bm{v}}(\bm{b}_{i})over¯ start_ARG bold_italic_v end_ARG ( bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), i=1,,m𝑖1𝑚i=1,\dots,mitalic_i = 1 , … , italic_m, the posterior distribution of the estimated upper bound using GPR is v^(𝒃)𝒩(μ(𝒃),σ(𝒃))similar-to^𝑣𝒃𝒩𝜇𝒃𝜎𝒃\widehat{v}(\bm{b})\sim\mathcal{N}(\mu(\bm{b}),\sigma(\bm{b}))over^ start_ARG italic_v end_ARG ( bold_italic_b ) ∼ caligraphic_N ( italic_μ ( bold_italic_b ) , italic_σ ( bold_italic_b ) ), where μ(𝒃)𝜇𝒃\mu(\bm{b})italic_μ ( bold_italic_b ) is the mean prediction and σ(𝒃)𝜎𝒃\sigma(\bm{b})italic_σ ( bold_italic_b )is the standard deviation.

As discussed in Hauskrecht (2000), the sawtooth projection v¯(𝒃)¯𝑣𝒃\bar{v}(\bm{b})over¯ start_ARG italic_v end_ARG ( bold_italic_b ) overestimates the convex hull upper bound h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ), ensuring v¯(𝒃)h(𝒃)¯𝑣𝒃𝒃\bar{v}(\bm{b})\geq h(\bm{b})over¯ start_ARG italic_v end_ARG ( bold_italic_b ) ≥ italic_h ( bold_italic_b ). Since the GPR model is trained on sawtooth projections, the posterior mean μ(𝒃)𝜇𝒃\mu(\bm{b})italic_μ ( bold_italic_b ) progressively approximates h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ) as additional belief points are sampled. Using the properties of Gaussian distribution, the normalized deviation of h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ) from μ(𝒃)𝜇𝒃\mu(\bm{b})italic_μ ( bold_italic_b ), relative to the standard deviation σ(𝒃)𝜎𝒃\sigma(\bm{b})italic_σ ( bold_italic_b ), satisfies

P(|h(𝒃)μ(𝒃)|σ(𝒃)η)12πeη2/2.𝑃𝒃𝜇𝒃𝜎𝒃𝜂12𝜋superscript𝑒superscript𝜂22P\left(\frac{|h(\bm{b})-\mu(\bm{b})|}{\sigma(\bm{b})}\leq\eta\right)\geq\frac{% 1}{\sqrt{2\pi}}e^{-\eta^{2}/2}~{}.italic_P ( divide start_ARG | italic_h ( bold_italic_b ) - italic_μ ( bold_italic_b ) | end_ARG start_ARG italic_σ ( bold_italic_b ) end_ARG ≤ italic_η ) ≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT .

Rearranging this inequality, we obtain P(|h(𝒃)μ(𝒃)|ησ(𝒃))1δ,𝑃𝒃𝜇𝒃𝜂𝜎𝒃1𝛿P\left(|h(\bm{b})-\mu(\bm{b})|\leq\eta\sigma(\bm{b})\right)\geq 1-\delta~{},italic_P ( | italic_h ( bold_italic_b ) - italic_μ ( bold_italic_b ) | ≤ italic_η italic_σ ( bold_italic_b ) ) ≥ 1 - italic_δ , where δ=112πeη2/2𝛿112𝜋superscript𝑒superscript𝜂22{\delta=1-\frac{1}{\sqrt{2\pi}}e^{-\eta^{2}/2}}italic_δ = 1 - divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT. ∎∎

Theorem 1 indicates that as more belief points are added, the uncertainty in the convex hull prediction, σ(𝒃)𝜎𝒃\sigma(\bm{b})italic_σ ( bold_italic_b ) decreases, reflecting greater confidence in the GPR prediction. Consequently, the posterior mean μ(𝒃)𝜇𝒃\mu(\bm{b})italic_μ ( bold_italic_b ) converges to the convex hull upper bound h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ).

Corollary 1.

As μ(𝐛)𝜇𝐛\mu(\bm{b})italic_μ ( bold_italic_b ) approaches h(𝐛)𝐛h(\bm{b})italic_h ( bold_italic_b ), the proposed upper confidence bound V¯UCB(𝐛)subscript¯𝑉𝑈𝐶𝐵𝐛\overline{V}_{UCB}(\bm{b})over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ) becomes a conservative estimate of true upper bound, i.e.

P(h(𝒃)V¯UCB(𝒃))12πeη2/2.𝑃𝒃subscript¯𝑉𝑈𝐶𝐵𝒃12𝜋superscript𝑒superscript𝜂22P\left(h(\bm{b})\leq\overline{V}_{UCB}(\bm{b})\right)\geq\frac{1}{\sqrt{2\pi}}% e^{-\eta^{2}/2}~{}.italic_P ( italic_h ( bold_italic_b ) ≤ over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ) ) ≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT .
Proof.

Proof. Consider the definition of the proposed upper confidence bound V¯UCB(𝒃)subscript¯𝑉𝑈𝐶𝐵𝒃\overline{V}_{UCB}(\bm{b})over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ): V¯UCB(𝒃)=μ(𝒃)+ησ(𝒃).subscript¯𝑉𝑈𝐶𝐵𝒃𝜇𝒃𝜂𝜎𝒃\overline{V}_{UCB}(\bm{b})=\mu(\bm{b})+\eta\sigma(\bm{b}).over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ) = italic_μ ( bold_italic_b ) + italic_η italic_σ ( bold_italic_b ) . Theorem 1 ensures that μ(𝒃)𝜇𝒃\mu(\bm{b})italic_μ ( bold_italic_b ) is close to h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ) with high probability. Therefore, the probability that the upper bound convex hull h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ) lies below the proposed upper confidence bound V¯UCB(𝒃)subscript¯𝑉𝑈𝐶𝐵𝒃\overline{V}_{UCB}(\bm{b})over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ) is P(h(𝒃)V¯UCB(𝒃))=P(h(𝒃)μ(𝒃)ησ(𝒃)).𝑃𝒃subscript¯𝑉𝑈𝐶𝐵𝒃𝑃𝒃𝜇𝒃𝜂𝜎𝒃P\left(h(\bm{b})\leq\overline{V}_{UCB}(\bm{b})\right)=P\left(h(\bm{b})-\mu(\bm% {b})\leq\eta\sigma(\bm{b})\right).italic_P ( italic_h ( bold_italic_b ) ≤ over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ) ) = italic_P ( italic_h ( bold_italic_b ) - italic_μ ( bold_italic_b ) ≤ italic_η italic_σ ( bold_italic_b ) ) . From Theorem 1, we know that P(|h(𝒃)μ(𝒃)|ησ(𝒃))1δ.𝑃𝒃𝜇𝒃𝜂𝜎𝒃1𝛿P\left(|h(\bm{b})-\mu(\bm{b})|\leq\eta\sigma(\bm{b})\right)\geq 1-\delta.italic_P ( | italic_h ( bold_italic_b ) - italic_μ ( bold_italic_b ) | ≤ italic_η italic_σ ( bold_italic_b ) ) ≥ 1 - italic_δ . Substituting δ=112πeη2/2𝛿112𝜋superscript𝑒superscript𝜂22{\delta=1-\frac{1}{\sqrt{2\pi}}e^{-\eta^{2}/2}}italic_δ = 1 - divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT, we get

P(h(𝒃)V¯UCB(𝒃))12πeη2/2.𝑃𝒃subscript¯𝑉𝑈𝐶𝐵𝒃12𝜋superscript𝑒superscript𝜂22P\left(h(\bm{b})\leq\overline{V}_{UCB}(\bm{b})\right)\geq\frac{1}{\sqrt{2\pi}}% e^{-\eta^{2}/2}~{}.italic_P ( italic_h ( bold_italic_b ) ≤ over¯ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_U italic_C italic_B end_POSTSUBSCRIPT ( bold_italic_b ) ) ≥ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT .

Thus, as μ(𝒃)𝜇𝒃\mu(\bm{b})italic_μ ( bold_italic_b ) converges to h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ) with more belief points sampled and σ(𝒃)𝜎𝒃\sigma(\bm{b})italic_σ ( bold_italic_b ) decreases, the proposed upper confidence bound becomes a probabilistically conservative estimate of the true upper bound h(𝒃)𝒃h(\bm{b})italic_h ( bold_italic_b ). ∎∎

Refer to caption
Figure 5: Estimation of upper bounds with the GPR model in the tiger problem.

Figure 5 shows how the GPR model estimates the upper bounds for the tiger problem. Starting from (0.5,0.5)0.50.5(0.5,0.5)( 0.5 , 0.5 ), the belief points reachable by t+2𝑡2t+2italic_t + 2 are depicted as circles, with sampled beliefs shown on the red routes. The upper bounds of these sampled beliefs are updated using backward recursion. At t+2𝑡2t+2italic_t + 2, the GPR model is constructed using the most informative belief points in the belief set (shown with orange-shaded circles). In this example, one of the belief points at t+2𝑡2t+2italic_t + 2 (shown with a red edge but not orange-shaded) is not selected to train the GPR due to its lack of information gain. In addition, note that the belief point (0.5,0.5)0.50.5(0.5,0.5)( 0.5 , 0.5 ), which is not a belief point originally sampled, is selected to train the GPR model based on the ALD criterion, indicating it is an informative belief point. The trained GPR model is then used to estimate the upper bounds of other reachable belief points at t+2𝑡2t+2italic_t + 2 using the GP-UCB. Finally, we estimate upper bounds for belief points at t+1𝑡1t+1italic_t + 1 using the backward recursion given by Eqn. (C.3) in Appendix B.

Algorithm 2 outlines the steps for approximating an upper bound to the value function using our proposed GP-UCB approach.

Algorithm 2 GP-UCB Algorithm
1:Input: Planning horizon T𝑇Titalic_T, initial belief 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, threshold ϵitalic-ϵ\epsilonitalic_ϵ
2:Output: V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\overline{V_{0}}(\bm{b}_{0})over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\underline{V_{0}}(\bm{b}_{0})under¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and 𝚐𝚊𝚙𝚐𝚊𝚙\tt{gap}typewriter_gap for initial belief 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
3:Initialization: For t{0,1,,T1}𝑡01𝑇1t\in\{0,1,\dots,T-1\}italic_t ∈ { 0 , 1 , … , italic_T - 1 }, initialize 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT to include all corner points and 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, initialize training set 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to include all points in 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT, and initialize ΓtsubscriptΓ𝑡\Gamma_{t}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using Eqn. (2.6).
4:while running time <<< time limit and 𝚐𝚊𝚙>ϵ𝚐𝚊𝚙italic-ϵ\tt{gap}>\epsilontypewriter_gap > italic_ϵ do
5:    Expand 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT for t{0,1,,T1}𝑡01𝑇1t\in\{0,1,\dots,T-1\}italic_t ∈ { 0 , 1 , … , italic_T - 1 } using an sampling method, e.g., max-gap (lines 6-13 in Algo. 1), random (lines 14-17 in Algo. 1), fixed-grid (lines 18-20 in Algo. 1)
6:    for t=T1𝑡𝑇1t=T-1italic_t = italic_T - 1 to 00 do
7:         Update ΓtsubscriptΓ𝑡\Gamma_{t}roman_Γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT using Eqn. (2.6)
8:         if first iteration then
9:             Initialize GPR model for t𝑡titalic_t with beliefs in 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
10:         else
11:             if initial iterations or 𝚐𝚊𝚙𝚐𝚊𝚙\tt{gap}typewriter_gap change>100ϵabsent100italic-ϵ>100\cdot\epsilon> 100 ⋅ italic_ϵ or periodic check iteration then
12:                 Update all upper bounds for beliefs in 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using sawtooth projection
13:                 Fit the GPR model with updated upper bounds of the beliefs in 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
14:             else
15:                 Randomly select a belief 𝒃𝒃\bm{b}bold_italic_b in 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
16:                 Compute the updated sawtooth projection for 𝒃𝒃\bm{b}bold_italic_b
17:                 Update the GPR fit using revised sawtooth projections
18:             end if
19:         end if
20:         if 𝓉subscript𝓉\cal{B}_{t}caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT expanded with new points = True then
21:             Predict the covariance for new reachable beliefs using the GPR model
22:             if Covariance \geq ALD threshold then
23:                 Expand 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the new reachable belief
24:                 Update GPR model using the expanded beliefs in 𝑩tsubscript𝑩𝑡\bm{B}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
25:             end if
26:         end if
27:         Update Vt¯(𝒃)¯subscript𝑉𝑡𝒃\overline{V_{t}}(\bm{b})over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) for 𝒃𝓉for-all𝒃subscript𝓉\forall\bm{b}\in\cal{B}_{t}∀ bold_italic_b ∈ caligraphic_B start_POSTSUBSCRIPT caligraphic_t end_POSTSUBSCRIPT using Eqn. (C.3) with GPR model predictions
28:    end for
29:    Compute V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\overline{V_{0}}(\bm{b}_{0})over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\underline{V_{0}}(\bm{b}_{0})under¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for belief 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and update 𝚐𝚊𝚙=V0¯(𝒃0)V0¯(𝒃0)𝚐𝚊𝚙¯subscript𝑉0subscript𝒃0¯subscript𝑉0subscript𝒃0{\tt gap}=\overline{V_{0}}(\bm{b}_{0})-\underline{V_{0}}(\bm{b}_{0})typewriter_gap = over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - under¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
30:end while
31:return V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\overline{V_{0}}(\bm{b}_{0})over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\underline{V_{0}}(\bm{b}_{0})under¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and 𝚐𝚊𝚙𝚐𝚊𝚙\tt{gap}typewriter_gap for the 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

4 Numerical Experiments

In this section, we present the experimental evaluation of our approach using five commonly used test problems from pomdp.org (Cassandra 2025). These examples are selected to compare the performance of the algorithm on a range of problem sized, including varying numbers of states, actions, observations, and horizon length.

Table 1 summarizes the environmental variables for the five test problems used in our experiments. Since we focus on the finite planning horizon, we consider undiscounted versions of the problems (i.e., discount factors are set to 1111). Additionally, we set the initial belief points for all examples as the center point, i.e., (1/|𝒮|,1/|𝒮|,,1/|𝒮|)1𝒮1𝒮1𝒮(1/|{\cal S}|,1/|{\cal S}|,...,1/|{\cal S}|)( 1 / | caligraphic_S | , 1 / | caligraphic_S | , … , 1 / | caligraphic_S | ), which indicates that all states are equally likely at t=0𝑡0t=0italic_t = 0. To ensure robustness and account for variability due to the random selection of belief point in updating the GPR model (lines 15-17 in Algorithm 2) or random sampling, each experiment is repeated 10 times. All experiments were implemented in Python and solved on a Dell Desktop XPS 8940 with an Intel Core i5-11400 Processor and 40.0 GB RAM to ensure a fair comparison.

Table 1: Problem size variables for the test problems
|𝒮|𝒮|\cal S|| caligraphic_S | |𝒜|𝒜|\cal A|| caligraphic_A | |𝒪|𝒪|\cal O|| caligraphic_O | T𝑇Titalic_T
ChengD51 5 3 3 10, 15, 20, 40
Network 7 4 2 10, 15, 20, 40
Query 27 3 3 10, 15, 20, 40
Hallway 60 5 21 10, 15, 20, 40
Aloha.30 90 29 3 10, 15, 20, 40

Experiments are terminated based on two stopping criteria: either exceeding 3000 seconds or reaching a target gap of gasubscript𝑔𝑎g_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. The target gap gasubscript𝑔𝑎g_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is dynamically determined after each iteration based on the magnitude of the upper bounds of the value function for the belief point 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at stage t=0𝑡0t=0italic_t = 0. Specifically, gasubscript𝑔𝑎g_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is computed as ga=Λ(V0¯(𝒃0))10ρ,subscript𝑔𝑎Λ¯subscript𝑉0subscript𝒃0superscript10𝜌g_{a}=\frac{\Lambda\left(\overline{V_{0}}(\bm{b}_{0})\right)}{10^{\rho}}~{},italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = divide start_ARG roman_Λ ( over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) end_ARG start_ARG 10 start_POSTSUPERSCRIPT italic_ρ end_POSTSUPERSCRIPT end_ARG , where V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\overline{V_{0}}(\bm{b}_{0})over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the upper bound of the value function for the initial belief point 𝒃0subscript𝒃0\bm{b}_{0}bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at stage t=0𝑡0t=0italic_t = 0 after the latest iteration, and Λ(V0¯(𝒃0))Λ¯subscript𝑉0subscript𝒃0\Lambda(\overline{V_{0}}(\bm{b}_{0}))roman_Λ ( over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) rounds V0¯(𝒃0)¯subscript𝑉0subscript𝒃0\overline{V_{0}}(\bm{b}_{0})over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) up to the nearest power of 10. For example, if V0¯(𝒃0)=281¯subscript𝑉0subscript𝒃0281\overline{V_{0}}(\bm{b}_{0})=281over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 281, then Λ(281)=1000Λ2811000\Lambda(281)=1000roman_Λ ( 281 ) = 1000, and if V0¯(𝒃0)=64¯subscript𝑉0subscript𝒃064\overline{V_{0}}(\bm{b}_{0})=64over¯ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( bold_italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 64, then Λ(64)=100Λ64100\Lambda(64)=100roman_Λ ( 64 ) = 100. This way, the precision parameter ρ𝜌\rhoitalic_ρ directly controls the stringency of gasubscript𝑔𝑎g_{a}italic_g start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. Larger values of ρ𝜌\rhoitalic_ρ yield smaller gaps, resulting in more precise approximations. We set the precision level, ρ𝜌\rhoitalic_ρ, as 5555 in our experiments. We set the uncertainty level, η𝜂\etaitalic_η, of the upper confidence bound to 1111, and the accuracy level, ν𝜈\nuitalic_ν, of the ALD condition to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. We set the initial iteration phase in line 11 of Algorithm 2 at 5 iterations and perform periodic checks every 5 iterations in our experiments.

To compare and contrast the performance of the different algorithms, we focus on two primary metrics: (i) the lower bound of the value function for the initial belief point 𝒃𝟎subscript𝒃0\bm{b_{0}}bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT at stage t=0𝑡0t=0italic_t = 0, and (ii) the gap between the upper and lower bounds for 𝒃𝟎subscript𝒃0\bm{b_{0}}bold_italic_b start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT. For experiments involving random selection processes, such as the random selection of belief points in GPR model updates in the GP-UCB approach, both metrics of lower bound and gap are averaged over 10 runs to account for variability and to ensure reliable comparisons.

In our study, we evaluate the performance of our GP-UCB approach in comparison to the sawtooth projection method under three different belief point sampling strategies: (1) max-gap sampling based on the work from Walraven and Spaan (2019), (2) random sampling proposed by Spaan and Vlassis (2005), and (3) fixed-grid sampling from Lovejoy (1991a). These strategies are chosen to highlight different aspects of belief point selection, with max-gap focusing on regions of high uncertainty in the value function, random sampling providing a baseline of unbiased coverage, and fixed-grid offering a structured exploration of the belief space. Combining each sampling strategy with the two approximation methods results in a total of six experimental settings, enabling a comprehensive analysis of their relative performance under varying sampling approaches. Among these six settings, fixed-grid sampling is fully deterministic for both the GP-UCB and sawtooth approaches, while max-gap sampling is also deterministic when used with the sawtooth approach.

5 Performance Results

A key advantage of the GP-UCB approach is its ability to significantly reduce the computational time required for sawtooth projections. Figure 6 compares the growth in the number of sawtooth executions over iterations for the max-gap-sawtooth and max-gap-GPUCB approaches, for all test problems. The results indicate that although both methods show an increase in sawtooth executions as iterations progress, the growth is substantially slower for max-gap-GPUCB. Using GP-UCB to estimate the upper bounds for belief points minimizes the need for frequent and computationally expensive sawtooth calculations. This efficiency allows the GP-UCB approach to achieve smaller gaps within the same computational time, or reach the target gap significantly faster than max-gap-sawtooth.

Refer to caption
Figure 6: Comparison of the number of sawtooth executions over iterations between algorithms with pure sawtooth and with GP-UCB.

Table 2 presents the experimental results for lower bounds and gaps at h=4040h=40italic_h = 40, based on the predefined stopping criteria. Full results for all problems under all tested planning horizons (i.e., 10, 15, 20, and 40 periods) are provided in Tables D.1 to D.5. Bolded numbers indicate the largest lower bounds (LB) and smallest gaps (Gap), highlighting the best-performing methods for these metrics. The values shown in parentheses represent the standard deviations reflecting the variability observed across multiple runs, where applicable. In addition to reporting the average gap, we also report on the worst-case gap observed across the ten runs to provide additional information on the robustness and consistency of the methods under varying conditions.

Upon comparing the results of GP-UCB and pure sawtooth projection under the max-gap belief expansion strategy, it is evident that GP-UCB consistently achieves superior or comparable performance. Specifically, in terms of the lower bound, GP-UCB reaches the same or higher values as the sawtooth method in all cases by the termination of the experiment runs. Looking at the gaps, GP-UCB outperforms the sawtooth method in all examples, with the exception of the network problem at h=1010h=10italic_h = 10 (full results are provided in Appendix D). In this particular case, GP-UCB requires only 6666 additional seconds to reach the target gap, demonstrating no significant disadvantage. For the remaining cases, GP-UCB either reaches the target gap faster or achieves a smaller gap when terminated at the 3000-second limit. Even when evaluating the worst-case gaps, the conclusions remain consistent with the observations for the average gaps across ten runs, further highlighting the efficiency of GP-UCB.

When evaluating the performance of sawtooth and GP-UCB under the random sampling strategy, we again observe consistent trends. For all test problems, GP-UCB achieves larger lower bounds and smaller gaps compared to the sawtooth method. This highlights the advantage of GP-UCB in improving the value function approximation even when belief points are sampled without a specific focus on regions of high uncertainty.

Finally, when comparing the GP-UCB and sawtooth projection methods under the fixed-grid belief sampling strategy, we observe that the fixed-grid strategy is unable to solve large-scale problems such as query, hallway, and aloha.30. This limitation arises due to the exponential growth in the number of grid points, leading to excessive computational and memory requirements. For the remaining two smaller problems, ChengD51 and network, the GP-UCB method consistently achieves the same lower bounds as the sawtooth method. However, GP-UCB demonstrates its superiority by achieving smaller gaps in all cases.

The above results demonstrate that the choice of belief expansion strategy significantly impacts performance. Among the three strategies evaluated (i.e.,max-gap, random sampling, and fixed-grid sampling), fixed-grid strategy consistently unperformed and was unable to solve large-scale problems such as query, hallway, and aloha.30 due to its poor scalability to high-dimensional belief spaces. Max-gap sampling generally achieves the smallest gaps, emphasizing its effectiveness in reducing uncertainty in the belief space. However, for large-scale problems like aloha.30, random sampling achieves larger lower bounds than max-gap sampling, as it favors exploration of the belief space.

Table 2: Experiment results for the five problems at h=4040h=40italic_h = 40. Bolded numbers indicate the largest upper bounds and smallest gaps, values in parentheses are standard deviations.
Max-gap GP-UCB Max-gap sawtooth Random GP-UCB Random sawtooth Fixed-grid GP-UCB Fixed-grid sawtooth
ChengD51 LB 261.713(0.000) 261.713 261.713(0.000) 261.713(0.000) 261.711 261.711
Gap 0.129(0.002) 0.170 2.087(0.058) 2.287(0.092) 73.452 75.449
Time (s) 3022.0(9.3) 3028.8 3069.6(26.9) 3019.3(22.2) 2.3 0.5
\cdashline2-8 Worst Gap 0.131 0.170 2.170 2.447 73.452 75.449
Network LB 592.274(0.003) 592.274 591.878(0.150) 591.719(0.229) 582.704 582.704
Gap 1.453(0.018) 2.419 42.844(1.844) 50.222(1.503) 543.012 660.788
Time (s) 3030.6(36.2) 3020.2 3015.0(15.7) 3020.2(13.1) 3.2 1.5
\cdashline2-8 Worst Gap 1.488 2.419 45.825 53.214 543.012 660.788
Query LB 120.327(0.000) 120.327 120.327(0.000) 120.327(0.000) NA NA
Gap 0.017(0.001) 0.026 0.010(0.001) 0.037(0.000) NA NA
Time (s) 3004.4(5.6) 3049.5 3018.8(32.9) 3027.4(30.6) NA NA
\cdashline2-8 Worst Gap 0.018 0.026 0.010 0.037 NA NA
Hallway LB 1.930(0.000) 1.912 1.612(0.067) 1.295(0.061) NA NA
Gap 1.220(0.000) 1.234 1.451(0.070) 1.510(0.061) NA NA
Time (s) 2871.8(18.9) 3030.7 3092.7(66.6) 3069.2(21.0) NA NA
\cdashline2-8 Worst Gap 1.220 1.234 1.571 1.638 NA NA
Aloha.30 LB 282.129(0.000) 282.039 283.338(0.249) 282.329(0.577) NA NA
Gap 10.428(0.003) 12.507 11.164(0.632) 12.480(0.932) NA NA
Time (s) 2956.0(27.2) 3086.8 3084.9(45.8) 3163.7(39.4) NA NA
\cdashline2-8 Worst Gap 10.435 12.507 12.113 14.570 NA NA

To further evaluate the efficiency of GP-UCB compared to the sawtooth projection method, we analyze the time required for GP-UCB to achieve the same (or smaller) gap as sawtooth at its terminating iteration under the max-gap belief expansion strategy. The results, summarized in Table 3, demonstrate significant time savings in most problems and planning horizons. For smaller problems like ChengD51 and network, GP-UCB achieves the sawtooth gaps in 30%-60% less time. For larger problems such as query, hallway, and aloha.30, the time savings are even more significant, reaching up to 99.7% for query and over 70% for aloha.30 at longer horizon lengths. While GP-UCB generally provides greater time savings at longer horizons, some variations exist across problems, such as Hallway. These differences likely stem from problem-specific structures and how efficiently GP-UCB generalizes across horizons. To capture these trends and exceptions, we present results for all planning horizons in Table 3.

Table 3: Time for GP-UCB to reach the same or smaller gap as sawtooth achieved at the terminating iteration for different problems using max-gap method for belief expansion
Problem Planning Horizon Sawtooth GP-UCB Time Reduction (%)
Gap Time (s) Gap Time (s)
ChengD51 h=10 0.005 3022.4 0.005 1860.5 38.4%
h=15 0.021 3010.9 0.021 1873.5 37.8%
h=20 0.045 3032.9 0.044 1685.1 44.4%
h=40 0.170 3028.8 0.167 1909.0 37.0%
Network h=10 0.010 52.9 0.010 58.8 -11.2%
h=15 0.039 3011.3 0.038 2060.9 31.6%
h=20 0.346 3029.9 0.342 1183.4 60.9%
h=40 2.419 3020.2 2.340 1369.1 54.7%
Query h=10 0.001 80.7 0.001 5.7 92.9%
h=15 0.002 3000.8 0.001 8.6 99.7%
h=20 0.004 3025.6 0.003 11.1 99.6%
h=40 0.026 3049.5 0.026 33.5 98.9%
Hallway h=10 0.168 3034.5 0.166 2759.3 9.1%
h=15 0.360 3082.8 0.358 1458.9 52.7%
h=20 0.524 2883.4 0.524 2337.9 18.9%
h=40 1.234 3030.7 1.222 2922.8 3.6%
Aloha.30 h=10 0.736 2962.7 0.735 2728.3 7.9%
h=15 2.004 2912.5 1.978 776.9 73.3%
h=20 4.003 3070.2 3.952 937.7 69.5%
h=40 12.507 3086.8 12.419 740.2 76.0%

To provide a visual representation of the experiment results, we plot the lower bounds and upper bounds of the value function during the execution of the max-gap-GP-UCB and max-gap-sawtooth methods at h=4040h=40italic_h = 40 in Figure 7. Full results are shown in Figure D.1. Consistent with the findings in Tables 2, the GP-UCB approach demonstrates strong performance in all test problems, achieving more favorable lower and upper bounds in most cases. For problems such as query and aloha.30, GP-UCB achieves significantly smaller upper bounds from the very beginning of the experiment runs, demonstrating its efficiency in approximating the value function in these cases. In contrast, for problems like the hallway problem, the reduction in upper bounds is less significant, indicating that certain problem structures may inherently limit the performance of GP-UCB.

One limitation of our approach is the need to select the kernel function. In these experiments, we used an exponential kernel for the GP-UCB approach. While this choice works well for most problems, larger problems like hallway and aloha.30 may benefit from alternative kernels such as the Matérn kernel, which can better generalize across larger belief spaces and avoid overfitting to the small initial data sets.

Refer to caption
Figure 7: Upper and lower bounds of the value function for the five examples at h=4040h=40italic_h = 40. Dots represent upper and lower bounds from FiVI, while crosses indicate those from AUCB, color-coded as shown in the legend.

6 Conclusions

In this paper, we present GP-UCB, a novel approach to accelerate the upper bound estimation in point-based value iteration to solve finite-horizon POMDPs. By utilizing Gaussian Process Regression to approximate the upper bound convex hull across the belief space, our method reduces computational complexity without sacrificing solution quality. The approach selectively trains on the most informative subset of belief points, significantly enhancing solution efficiency and scalability. This innovation avoids redundant upper bound updates and provides theoretical guarantees for the convergence of the proposed GP-UCB to the true upper bound convex hull. Through extensive experiments on benchmark problems, we demonstrated that GP-UCB consistently accelerates convergence and improves scalability, reducing computation time by 30%-60% on smaller problems and up to 99.7% on larger ones while maintaining the same upper bound gaps as the pure sawtooth projection method. These results underscore its potential as a powerful tool for solving practical large-scale POMDPs in finite-horizon settings.

In the proposed approach, the GPR model uses training data from sawtooth projections, which are computationally efficient for estimating upper bounds. However, a trade-off exists between computation time and data quality. Generating higher-quality data results in better approximations for upper bounds, but may take longer. Future research will focus on finding a way to balance this trade-off to improve GPR model performance. Another promising direction is optimizing the selection of kernels for GPR, as the choice of kernel directly impacts the quality of the approximation and computational efficiency. Investigating adaptive kernel selection strategies or using domain-specific kernels could enhance the flexibility and accuracy of the GPR model.

References

  • Ayer et al. (2012) Ayer T, Alagoz O, Stout NK (2012) OR forum—a POMDP approach to personalize mammography screening decisions. Operations Research 60(5):1019–1034.
  • Bai et al. (2012) Bai H, Hsu D, Kochenderfer MJ, Lee WS (2012) Unmanned aircraft collision avoidance using continuous-state POMDPs. Robotics: Science and Systems VII 1:1–8.
  • Cassandra (1998) Cassandra AR (1998) A survey of POMDP applications. Working notes of AAAI 1998 fall symposium on planning with partially observable Markov decision processes, volume 1724.
  • Cassandra (2025) Cassandra AR (2025) Partially Observable Markov Decision Processes (POMDPs). http://www.pomdp.org, accessed: 04-Feb-2025.
  • Cassandra et al. (1997) Cassandra AR, Littman ML, Zhang NL (1997) Incremental pruning: A simple, fast, exact method for partially observable Markov decision processes. Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI) 54–61.
  • Cheng (1988) Cheng Hh (1988) Algorithms for partially observable markov decision processes. Proceedings of the 24th IEEE Conference on Decision and Control, 1209–1211 (IEEE).
  • Deep et al. (2023) Deep A, Zhou S, Veeramani D, Chen Y (2023) Partially observable markov decision process-based optimal maintenance planning with time-dependent observations. European Journal of Operational Research 311(2):533–544.
  • Engel et al. (2004) Engel Y, Mannor S, Meir R (2004) The kernel recursive least-squares algorithm. IEEE Transactions on signal processing 52(8):2275–2285.
  • Gong and Liu (2023) Gong J, Liu S (2023) Partially observable collaborative model for optimizing personalized treatment selection. European Journal of Operational Research 309(3):1409–1419.
  • Hajjar and Alagoz (2023) Hajjar A, Alagoz O (2023) Personalized disease screening decisions considering a chronic condition. Management Science 69(1):260–282.
  • Hauskrecht (2000) Hauskrecht M (2000) Value-function approximations for partially observable Markov decision processes. Journal of artificial intelligence research 13:33–94.
  • Islam et al. (2024) Islam UJ, Paynabar K, Runger G, Iquebal AS (2024) Dynamic exploration–exploitation trade-off in active learning regression with bayesian hierarchical modeling. IISE Transactions 1–15.
  • Kaelbling et al. (1998) Kaelbling LP, Littman ML, Cassandra AR (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101(1-2):99–134.
  • Kim et al. (2018) Kim JW, Choi GB, Lee JM (2018) A POMDP framework for integrated scheduling of infrastructure maintenance and inspection. Computers & Chemical Engineering 112:239–252.
  • Kurniawati et al. (2008) Kurniawati H, Hsu D, Lee WS (2008) SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. Robotics: Science and systems, volume 2008 (Zurich, Switzerland).
  • Li et al. (2023) Li W, Denton BT, Morgan TM (2023) Optimizing active surveillance for prostate cancer using Partially Observable Markov Decision Processes. European Journal of Operational Research 305(1):386–399.
  • Lin et al. (2004) Lin ZZ, Bean JC, White CC (2004) A hybrid genetic/optimization algorithm for finite-horizon, partially observed Markov decision processes. INFORMS Journal on Computing 16(1):27–38.
  • Littman (1996) Littman ML (1996) Algorithms for sequential decision-making (Brown University).
  • Littman (2009) Littman ML (2009) A tutorial on partially observable Markov decision processes. Journal of Mathematical Psychology 53(3):119–125.
  • Littman et al. (1995) Littman ML, Cassandra AR, Kaelbling LP (1995) Learning policies for partially observable environments: Scaling up. Proceedings of the Twelfth International Conference on Machine Learning (ICML), 362–370.
  • Liu et al. (2022) Liu Z, Khojandi A, Li X, Mohammed A, Davis RL, Kamaleswaran R (2022) A machine learning–enabled partially observable markov decision process framework for early sepsis prediction. INFORMS Journal on Computing 34(4):2039–2057.
  • Lovejoy (1991a) Lovejoy WS (1991a) Computationally feasible bounds for partially observed Markov decision processes. Operations Research 39(1):162–175.
  • Lovejoy (1991b) Lovejoy WS (1991b) A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research 28(1):47–65.
  • Madani et al. (1999) Madani O, Hanks S, Condon A (1999) On the undecidability of probabilistic planning and infinite-horizon partially observable Markov decision problems. Aaai/iaai 10(315149.315395).
  • Monahan (1982) Monahan G (1982) A survey of partially observable markov decision processes: theory, models, and algorithms. Management Science 28(1):1–16.
  • Mueller and Kochenderfer (2016) Mueller ER, Kochenderfer M (2016) Multi-rotor aircraft collision avoidance using partially observable Markov decision processes. AIAA Modeling and Simulation Technologies Conference, 3673.
  • Papadimitriou and Tsitsiklis (1987) Papadimitriou CH, Tsitsiklis JN (1987) The complexity of Markov decision processes. Mathematics of Operations Research 12(3):441–450.
  • Papakonstantinou and Shinozuka (2014) Papakonstantinou KG, Shinozuka M (2014) Planning structural inspection and maintenance policies via dynamic programming and Markov processes. part ii: POMDP implementation. Reliability Engineering & System Safety 130:214–224.
  • Pineau et al. (2003) Pineau J, Gordon G, Thrun S (2003) Point-based value iteration: An anytime algorithm for POMDPs. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1025–1030.
  • Poupart et al. (2011) Poupart P, Kim KE, Kim D (2011) Closing the gap: Improved bounds on optimal POMDP solutions. Proceedings of the International Conference on Automated Planning and Scheduling, volume 21.
  • Shani et al. (2013) Shani G, Pineau J, Kaplow R (2013) A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems 27(1):1–51.
  • Smallwood and Sondik (1973) Smallwood RD, Sondik EJ (1973) The optimal control of partially observable markov processes over a finite horizon. Operations Research 21(5):1071–1088.
  • Smith and Simmons (2004) Smith T, Simmons R (2004) Heuristic search value iteration for pomdps. Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI), 520–527 (AUAI Press).
  • Smith and Simmons (2005) Smith T, Simmons R (2005) Point-based POMDP algorithms: Improved analysis and implementation. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI) 542–549.
  • Smith and Simmons (2012) Smith T, Simmons R (2012) Point-based POMDP algorithms: Improved analysis and implementation. arXiv preprint arXiv:1207.1412 .
  • Sondik (1971) Sondik EJ (1971) The optimal control of Partially Observable Markov Processes (Stanford University).
  • Sondik (1978) Sondik EJ (1978) The optimal control of Partially Observable Markov Processes over the infinite horizon: Discounted costs. Operations Research 26(2):282–304.
  • Song et al. (2022) Song C, Zhang C, Shafieezadeh A, Xiao R (2022) Value of information analysis in non-stationary stochastic decision environments: A reliability-assisted POMDP approach. Reliability Engineering & System Safety 217:108034.
  • Spaan and Vlassis (2005) Spaan MT, Vlassis N (2005) Perseus: Randomized point-based value iteration for POMDPs. Journal of artificial intelligence research 24:195–220.
  • Temizer et al. (2010) Temizer S, Kochenderfer M, Kaelbling L, Lozano-Pérez T, Kuchar J (2010) Collision avoidance for unmanned aircraft using Markov decision processes. AIAA Guidance Navigation and Control Conference, 8040.
  • Vozikis and Goulionis (2009) Vozikis A, Goulionis JE (2009) Medical decision making for patients with parkinson disease under average cost criterion. Australia and New Zealand Health Policy 6(1).
  • Walraven and Spaan (2019) Walraven E, Spaan MT (2019) Point-based value iteration for finite-horizon POMDPs. Journal of Artificial Intelligence Research 65:307–341.
  • Zhang and Liu (1996) Zhang NL, Liu W (1996) Planning in stochastic domains: Problem characteristics and approximation. Technical report, Technical Report HKUST-CS96-31, Department of Computer Science, Hong Kong University of Science and Technology.
  • Zhang and Wang (2022) Zhang W, Wang H (2022) Diagnostic policies optimization for chronic diseases based on POMDP model. Healthcare, volume 10:2, 283 (MDPI).
  • Zois (2016) Zois DS (2016) Sequential decision-making in healthcare iot: Real-time health monitoring, treatments and interventions. 2016 IEEE 3rd World Forum on Internet of Things (WF-IoT), 24–29 (IEEE).

Appendix A Background on POMDP Problems and Solution Algorithms

Like MDPs, POMDPs identify control actions based on the current state of the system, but since the system state is only partially observable, decisions are made based on a belief state—a probability distribution over the possible system states that represents the current estimate of the system’s true state. This belief state is updated through systemic observations that provide probabilistic information about the system state at any given time. In addition to the usual MDP assumptions regarding action- and state-dependent transition probabilities, POMDP formulations assume a likelihood function, which gives the probability of observing a particular outcome for each system state. This allows for the computation of the next belief state as a function of the current belief state, the action taken, and the observation obtained (Littman 2009). Since the belief state is a continuous vector over the probability distribution of the system states, the belief space becomes continuous, making the solution of POMDPs much more difficult (Papadimitriou and Tsitsiklis 1987, Kaelbling et al. 1998, Madani et al. 1999).

The structural properties of the value function in POMDPs, for both finite- and infinite-horizon problems, are well understood. In both cases, the value function, which maps a belief state to the expected total reward, is piecewise-linear and convex in the belief space (Sondik 1971, Smallwood and Sondik 1973). This means that the value function can be represented as the maximum of a set of linear functions over the belief space, each associated with a specific action. These linear functions are known as alpha-vectors, where each alpha-vector corresponds to a specific action and encodes the expected future rewards for that action given the current belief state (Sondik 1971). For any belief state, the value is determined by selecting the alpha-vector that maximizes the expected reward. In the finite-horizon case, the set of alpha-vectors is updated stage by stage using backward induction to compute the value function at each time step (Smallwood and Sondik 1973). In infinite-horizon problems, the value function is stationary, and the alpha-vectors are iteratively refined until a fixed point is reached (Sondik 1978, Monahan 1982, Lovejoy 1991b). However, the number of belief states and corresponding alpha-vectors grows exponentially with the size of the state and observation spaces, leading to the well-known “curse of dimensionality.” This exponential growth makes the computation of exact solutions infeasible for large-scale problems, even when pruning techniques are employed to reduce the number of alpha-vectors (Sondik 1971, Monahan 1982, Cheng 1988, Littman 1996, Zhang and Liu 1996, Cassandra et al. 1997).

A.1 Point-Based Value Iteration

After the structural properties of POMDP value functions were established, and the challenges of solving them exactly for large problems became evident, researchers developed various approximate methods. Among the most prominent of these is the Point-Based Value Iteration (PBVI) algorithm, introduced by Pineau et al. (2003). PBVI was a major breakthrough because it significantly improved the scalability of POMDP solvers by focusing on a subset of belief points, rather than attempting to compute the value function over the entire continuous belief space.

The key insight behind PBVI is that not all belief states are equally important for decision making. Instead of considering the entire belief space, PBVI selects a representative set of belief points and performs computations on these points. By concentrating computational effort on a smaller set of critical belief states, PBVI drastically reduces the computational load while still providing a good approximation of the value function. This approach is particularly effective, as PBVI uses the structure of POMDPs where a limited set of belief points - especially those likely to be encountered under the optimal policy - can be sufficient to approximate the value function well enough for effective decision making (Pineau et al. 2003, Smith and Simmons 2005). This makes PBVI one of the most widely used algorithms to solve POMDPs, particularly for large and complex problems (Shani et al. 2013).

The main components of PBVI algorithms can be summarized as follows:

  1. 1.

    Select Initial Belief Points: A small set of representative belief points is initialized, typically using random sampling or simulations.

  2. 2.

    Backup Operation: The lower and upper bounds on the value function at each belief point are updated by considering the immediate reward and expected future rewards based on potential actions and observations.

  3. 3.

    Expand Belief Set: New belief points are added as needed to improve the approximation across the belief space.

  4. 4.

    Convergence: The process repeats until the difference between the upper and lower bounds falls below a set threshold, indicating convergence.

Since its introduction in 2003, various improvements have been made to each of the four key components of PBVI. For belief point selection, techniques such as heuristic sampling (Pineau et al. 2003) and more structured approaches such as reachability analysis (Kurniawati et al. 2008) have improved the initial set of belief points by focusing on those most relevant to the policy. The backup operation has seen optimizations with randomized belief backups, as in the Perseus algorithm (Spaan and Vlassis 2005), which selectively updates belief points to reduce computational cost. In terms of belief set expansion, advanced methods such as adaptive belief selection (Shani et al. 2013) and forward search techniques have improved the algorithm’s ability to refine the belief set efficiently.

For convergence, several approaches have been employed to accelerate the process. One such strategy is incremental pruning (Cassandra et al. 1997), which reduces the complexity of managing alpha-vectors by eliminating irrelevant vectors during each iteration. Another is heuristic search, which focuses computational resources on the most promising regions of the belief space, improving both convergence speed and efficiency (Smith and Simmons 2004). Additionally, bounded policy updates as implemented in Heuristic Search Value Iteration (HSVI) (Smith and Simmons 2005) ensure that the value function and policy are refined efficiently while maintaining convergence guarantees.

A.2 Upper and Lower Bounds

One of the most significant innovations for convergence and belief point selection is the introduction of upper and lower bounds on the value function. These bounds serve multiple purposes: they guide the selection of critical belief points by focusing on regions where the difference between the upper and lower bounds is largest, and they provide stopping criteria for the algorithm when the bounds converge sufficiently (Smith and Simmons 2005, Poupart et al. 2011). By integrating bounds into PBVI, convergence is not only accelerated, but the algorithm also gains the ability to focus more intelligently on belief points where improvement is needed most.

In PBVI algorithms, calculating the lower bounds is relatively straightforward and computationally inexpensive. For each belief point, the best alpha-vector (the one that yields the highest expected reward) is chosen, ensuring that the lower bound represents a conservative estimate of the value function. This method provides a reliable, minimal reward estimate without adding significant computational overhead. However, calculating upper bounds has traditionally been more complex and costly. Exact methods for the upper bound calculation involve projecting the belief point onto the convex hull of the belief/upper bound pairs. The tightest upper bound is obtained by minimizing the projection using linear programming methods. Although this provides the strongest possible upper bound, it is computationally expensive (Monahan 1982, Cheng 1988). The QMDP approach, which uses Q𝑄Qitalic_Q-values (i.e., the expected reward of taking an action in a state and following an optimal policy), simplifies the problem by assuming full observability after a single step, leading to an optimistic upper bound (Littman et al. 1995).

For such cases, approximate methods like those used in HSVI and Successive Approximations of the Reachable Space under Optimal Policies (SARSOP) algorithms offer more computational efficiency by deriving upper bounds from optimistic assumptions. In HSVI, the upper bound is updated by a heuristic search, exploring actions and observations optimistically, guiding the algorithm toward promising belief points (Smith and Simmons 2005). SARSOP focuses on reachable belief spaces to update upper bounds only for belief points likely to be encountered under an optimal policy, further reducing computational costs (Kurniawati et al. 2008). These heuristic-based methods are particularly useful in infinite-horizon POMDPs, where the value function is stationary, meaning upper bounds do not need frequent recalculation.

In finite-horizon POMDPs, the situation becomes more complex because the value function changes dynamically over time (Walraven and Spaan 2019). The upper bound must be recalculated at each time step, reflecting the evolving decision horizon. Calculating an exact upper bound for finite-horizon problems involves exploring all possible future outcomes optimistically, considering the best-case scenario at each stage, which leads to a high computational cost (Walraven and Spaan 2019, Smith and Simmons 2005). This process requires evaluating future rewards under all potential actions and observations for each belief point, further increasing the complexity.

Although PBVI has been effective for infinite-horizon problems, where a stationary policy optimizes the value function indefinitely, significant adjustments are necessary to apply it to finite-horizon problems. In infinite-horizon settings, the value function remains stationary, allowing algorithms such as PBVI to refine the value function iteratively without accounting for the remaining time steps (Pineau et al. 2003). However, in finite-horizon problems, the value function changes since the number of remaining decisions affects both immediate and future rewards. As the horizon shortens, the value function increasingly reflects immediate rather than future rewards. This requires the algorithm to adapt at each step to account for the evolving optimal policy (Walraven and Spaan 2019). Efficient mechanisms, particularly for accurate upper and lower bounds, are crucial to ensuring that belief point selection and value function updates remain computationally feasible.

To address these challenges, Walraven and Spaan (2019) implemented efficient methods for handling upper and lower bounds in finite-horizon problems. By estimating the maximum possible value for each belief point and focusing on the widest gap between the upper and lower bounds, the algorithm prioritizes regions where the value function is the least accurate. This accelerates convergence and reduces computation time, while upper bounds serve as a stopping criterion, terminating the algorithm when convergence is reached, ensuring a near-optimal policy without excessive refinement.

A.3 Sawtooth Projection to Obtain Upper Bounds

Instead of using exact upper bounds, Hauskrecht (2000) introduced the so-called sawtooth projections to dynamically adjust the upper bounds at each step of the horizon. This approximation leverages the convex structure of the value function to reduce computational complexity while maintaining a reasonable level of approximation accuracy. Poupart et al. (2011) reported that sawtooth projections can significantly reduce computational costs while preserving a high degree of solution quality. Specifically, in the worst-case scenario, the computational complexity for the approximation decreases to O(|||𝒮|)𝑂𝒮O(|\cal B||\cal S|)italic_O ( | caligraphic_B | | caligraphic_S | ), where \cal{B}caligraphic_B is a sampled belief set in the |𝒮|𝒮|\cal S|| caligraphic_S |-dimensional space with ||>|𝒮|𝒮|\cal B|>|\cal S|| caligraphic_B | > | caligraphic_S |.

The effectiveness of sawtooth projection in finite-horizon settings was recently demonstrated by Walraven and Spaan (2019) in their Finite-horizon Value Iteration (FiVI) algorithm. FiVI outperforms both the original PBVI (Pineau et al. 2003) and the Heuristic Search Value Iteration (HSVI) (Smith and Simmons 2005) in terms of speed and scalability, especially in large finite-horizon problems. By focusing computational effort where it is most needed, the sawtooth projection allows FiVI to achieve faster convergence without sacrificing solution quality. Furthermore, their numerical results showed that FiVI scales better with both the horizon length and the complexity of belief spaces. This makes it a more versatile option for large-scale applications, and less sensitive to the size of state and action spaces (Walraven and Spaan 2019).

The primary computational complexity of the sawtooth projection stems from the need to repeatedly update the upper bound each time a new belief point is added to the belief set. As this set expands, the upper bound approximation of the value function improves, narrowing the gap with the lower bound. In the initial stages of belief expansion, adding new points significantly improves the upper bound. However, as more belief points are added, the marginal improvement becomes minimal, even though the number of updates —and thus the computational complexity—increases (Hauskrecht 2000). This leads to upper bound updates that require computations that scale with the product of the cardinality of the belief set and the dimensionality of the belief state, resulting in a computational complexity of O(|||𝒮|)𝑂𝒮O(|\cal B||\cal S|)italic_O ( | caligraphic_B | | caligraphic_S | ).

Hence, while sawtooth projection offers a computationally efficient alternative to computing the convex hull of upper bounds, its effectiveness remains limited, as the number of sawtooth projection operations grows rapidly with the expanding belief set. As a result, this approach is still insufficient to significantly reduce complexity and solve larger problems.

In this work, we model the convex hull of the upper bounds by fitting a subset of the belief points and their sawtooth projections using GPR. As such, for any new belief point, the upper bound interpolation is directly inferred from the posterior of the GPR predictions, precluding the need to solve any linear programs or sawtooth projection. Furthermore, at every iteration of the upper bound approximation, we fit the GPR only to a subset of the belief/upper bound pairs that are most informative, to reduce the computational cost of covariance matrix inversion.

Notably, multiple belief points lying on the same alpha-vector are redundant and do not provide additional information about the value function. To avoid such redundant belief points while training the GPR, we argue that the most informative belief points are those that are approximately linearly independent in a high-dimensional Hilbert space (Islam et al. 2024, Engel et al. 2004). Therefore, each time a new belief point is added, we select a subset of belief points that satisfy the approximate linear dependence (ALD) criterion to train the GPR. This selected subset is usually significantly smaller than the entire belief set, which further reduces computational demands.

Finally, it must be noted that the proposed GP-UCB is only approximately correct, although with a very high probability. Hence, we provide theoretical results on the consistency of the proposed upper bound approximations as the belief set expands.

Appendix B Description of the tiger problem

To demonstrate the concept of reachable belief points in a POMDP, we use the tiger problem, originally introduced by Kaelbling et al. (1998). In this problem, an agent faces two doors, with a tiger behind one and treasure behind the other, i.e., the unobservable states are “Tiger Left” and “Tiger Right.” At every stage, the agent can open one of the doors (i.e., actions “open left” or “open right”) or listen (action “listen”) for a tiger growl to gather (noisy) information. In particular, there is an 85% chance of correctly estimating the location of the tiger after listening. The agent’s actions are associated with specific rewards and costs: opening the correct door yields a reward of +1010+10+ 10, while opening the door with the tiger results in a penalty of 100100-100- 100, and listening incurs a cost of 11-1- 1. After the agent opens a door, the state resets: the location of the tiger is randomized, placing it behind one of the doors with equal probability. Observations are then made in the new state, where both left and right actions lead to either observation (Growl from Left or Growl from Right) with a probability of 0.5, regardless of the actual location of the tiger. In the finite-horizon version, at the terminating stage, only the immediate reward or penalty from the final action is considered, without further actions or resets.

Appendix C Further details on sawtooth projection method

Hauskrecht (2000) proposed an interpolation approach for efficient approximation of the convex hull, referred to as sawtooth projection, to reduce computational complexity. Given an arbitrary belief 𝒃𝒃\bm{b}\in\cal{B}bold_italic_b ∈ caligraphic_B and corner beliefs {𝒘1,𝒘2,,𝒘|𝒮|}subscript𝒘1subscript𝒘2subscript𝒘𝒮\{\bm{w}_{1},\bm{w}_{2},...,\bm{w}_{|\cal{S}|}\}{ bold_italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_w start_POSTSUBSCRIPT | caligraphic_S | end_POSTSUBSCRIPT }, the sawtooth projection provides the upper bound interpolation for any new belief point 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT. The corner beliefs are defined as 𝒘s=(b(1),b(2),b(|𝒮|)\bm{w}_{s}=(b(1),b(2),...b(|\cal{S}|)bold_italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_b ( 1 ) , italic_b ( 2 ) , … italic_b ( | caligraphic_S | ), where b(s)=1𝑏𝑠1b(s)=1italic_b ( italic_s ) = 1 and b(s)=0𝑏superscript𝑠0b(s^{\prime})=0italic_b ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 for all sssuperscript𝑠𝑠s^{\prime}\neq sitalic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_s. The best upper bound approximation for a belief 𝒃𝒃\bm{b}bold_italic_b is the one that minimizes λ(𝒃)f(𝒃)𝜆𝒃𝑓𝒃\lambda(\bm{b})f(\bm{b})italic_λ ( bold_italic_b ) italic_f ( bold_italic_b ), where

λ(𝒃)𝜆𝒃\displaystyle\lambda(\bm{b})italic_λ ( bold_italic_b ) =\displaystyle== min{b(s)/b(s)|s𝒮}, andconditionalsuperscript𝑏𝑠𝑏𝑠𝑠𝒮 and\displaystyle\min\{b^{\prime}(s)/b(s)|s\in\cal S\}~{},\text{ and }roman_min { italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) / italic_b ( italic_s ) | italic_s ∈ caligraphic_S } , and (C.1)
f(𝒃)𝑓𝒃\displaystyle f(\bm{b})italic_f ( bold_italic_b ) =\displaystyle== V¯(𝒃)s𝒮b(s)V¯(𝒘s),¯𝑉𝒃subscript𝑠𝒮𝑏𝑠¯𝑉subscript𝒘𝑠\displaystyle\overline{V}(\bm{b})-\sum_{s\in\cal{S}}b(s)\overline{V}(\bm{w}_{s% })~{},over¯ start_ARG italic_V end_ARG ( bold_italic_b ) - ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) over¯ start_ARG italic_V end_ARG ( bold_italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , (C.2)

Intuitively, this method interprets the value function as a set of downward-pointing pyramids, with their bases at the corners of the belief space and their tips at known belief point upper bounds.

Following the sawtooth projection for an arbitrary belief point 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT at time stage t𝑡titalic_t, the upper bound for belief point, 𝒃𝒃\bm{b}bold_italic_b at stage t𝑡titalic_t is obtained as

Vt¯(𝒃)=maxa𝒜[s𝒮b(s)R(s,a)+o𝒪P(o|𝒃,a)v¯t+1(𝒃)],¯subscript𝑉𝑡𝒃𝑎𝒜delimited-[]subscript𝑠𝒮𝑏𝑠𝑅𝑠𝑎subscript𝑜𝒪𝑃conditional𝑜𝒃𝑎subscript¯𝑣𝑡1superscript𝒃bold-′\overline{V_{t}}(\bm{b})=\underset{a\in{\cal A}}{\max}\left[\sum_{s\in{\cal S}% }b(s)R(s,a)+\sum_{o\in{\cal O}}P(o|\bm{b},a)\bar{v}_{t+1}(\bm{b^{\prime}})% \right]~{},over¯ start_ARG italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( bold_italic_b ) = start_UNDERACCENT italic_a ∈ caligraphic_A end_UNDERACCENT start_ARG roman_max end_ARG [ ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_b ( italic_s ) italic_R ( italic_s , italic_a ) + ∑ start_POSTSUBSCRIPT italic_o ∈ caligraphic_O end_POSTSUBSCRIPT italic_P ( italic_o | bold_italic_b , italic_a ) over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT ) ] , (C.3)

where v¯t+1(𝒃)subscript¯𝑣𝑡1superscript𝒃\bar{v}_{t+1}(\bm{b}^{\prime})over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( bold_italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the upper bound projection obtained using sawtooth projection for the belief state 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT resulting from taking action a𝑎aitalic_a and observing o𝑜oitalic_o. Details of the upper bound projection using the sawtooth projection is presented in Algorithm C.1. Intuitively, the upper bound update is the maximum expected value of the sum of immediate reward and the upper bound of the future reward.

Algorithm C.1 Upperbound approximation with sawtooth projection method
1:Input: POMDP model, belief point 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT, belief-bound pairs (𝒃,V¯(𝒃))𝒃¯𝑉𝒃(\bm{b},\overline{V}(\bm{b}))( bold_italic_b , over¯ start_ARG italic_V end_ARG ( bold_italic_b ) ) for each belief 𝒃𝒃\bm{b}bold_italic_b in the belief set \cal{B}caligraphic_B
2:Output: upper bound corresponding to belief point 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT
3:for 𝒃{𝔀𝓈|𝓈𝒮}𝒃conditional-setsubscript𝔀𝓈𝓈𝒮\bm{b}\in\cal{B}\setminus\{\bm{w}_{s}|s\in S\}bold_italic_b ∈ caligraphic_B ∖ { bold_caligraphic_w start_POSTSUBSCRIPT caligraphic_s end_POSTSUBSCRIPT | caligraphic_s ∈ caligraphic_S } do
4:    f(𝒃)V¯(𝒃)sSb(s)V¯(𝒘s)𝑓𝒃¯𝑉𝒃subscript𝑠𝑆𝑏𝑠¯𝑉subscript𝒘𝑠f(\bm{b})\leftarrow\overline{V}(\bm{b})-\sum_{s\in S}b(s)\overline{V}(\bm{w}_{% s})italic_f ( bold_italic_b ) ← over¯ start_ARG italic_V end_ARG ( bold_italic_b ) - ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT italic_b ( italic_s ) over¯ start_ARG italic_V end_ARG ( bold_italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
5:    λ(𝒃)minsb(s)b(s)𝜆𝒃subscript𝑠superscript𝑏𝑠𝑏𝑠\lambda(\bm{b})\leftarrow\min_{s}\frac{b^{\prime}(s)}{b(s)}italic_λ ( bold_italic_b ) ← roman_min start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT divide start_ARG italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) end_ARG start_ARG italic_b ( italic_s ) end_ARG
6:end for
7:𝒃argminbλ(𝒃)f(𝒃)superscript𝒃subscript𝑏𝜆𝒃𝑓𝒃\bm{b}^{*}\leftarrow\arg\min_{b}\lambda(\bm{b})f(\bm{b})bold_italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← roman_arg roman_min start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_λ ( bold_italic_b ) italic_f ( bold_italic_b )
8:return λ(𝒃)f(𝒃)+sSb(s)V¯(𝒘s)𝜆superscript𝒃𝑓superscript𝒃subscript𝑠𝑆superscript𝑏𝑠¯𝑉subscript𝒘𝑠\lambda(\bm{b}^{*})f(\bm{b}^{*})+\sum_{s\in S}b^{\prime}(s)\overline{V}(\bm{w}% _{s})italic_λ ( bold_italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_f ( bold_italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s ) over¯ start_ARG italic_V end_ARG ( bold_italic_w start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )

Figure C.1 illustrates the sawtooth projection method for a two-state problem. The orange dots labeled 1 to 5 represent belief points with known upper bounds, with dots 1 and 5 being corner beliefs. By connecting belief points 2, 3, and 4 to the corner beliefs, three downward pointing triangles are formed. For a new belief point 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT, the sawtooth method estimates its upper bound by projecting 𝒃superscript𝒃bold-′\bm{b^{\prime}}bold_italic_b start_POSTSUPERSCRIPT bold_′ end_POSTSUPERSCRIPT onto the connecting lines and choosing the projection with the lowest value, which in this case is the line formed by dot 3 and corner belief 1. The estimated upper bound is represented by the red dot on the left of Figure C.1. The sawtooth projection for the entire belief space is shown as the red solid line in the right figure. For comparison, the optimal value function is plotted as the purple line in Figure C.1.

Refer to caption
Figure C.1: The approximation of upper bound with sawtooth projection.

Appendix D Full computational results

Table D.1: Experiment results for the ChengD51 problem. The target gap is 0.0010.0010.0010.001. Bolded numbers indicate the largest upper bounds and smallest gaps, values in parentheses are standard deviations.
Max-gap GP-UCB Max-gap sawtooth Random GP-UCB Random sawtooth Fixed-grid GP-UCB Fixed-grid sawtooth
h=10 LB 65.246(0.000) 65.246 65.246(0.000) 65.246(0.000) 65.246 65.246
Gap 0.004(0.000) 0.005 0.231(0.026) 0.263(0.020) 16.897 18.896
Time (s) 3008.2(7.0) 3022.4 3013.8(8.2) 3017.4(6.9) 0.4 0.1
\cdashline2-8 Worst Gap 0.004 0.005 0.278 0.308 16.897 18.896
h=15 LB 97.990(0.000) 97.990 97.990(0.000) 97.990(0.000) 97.990 97.990
Gap 0.017(0.000) 0.021 0.488(0.014) 0.562(0.049) 26.322 28.322
Time (s) 3020.0(17.6) 3010.9 3014.7(10.1) 3018.2(9.4) 0.7 0.2
\cdashline2-8 Worst Gap 0.017 0.021 0.502 0.650 26.322 28.322
h=20 LB 130.735(0.000) 130.735 130.735(0.000) 130.735(0.000) 130.734 130.734
Gap 0.034(0.001) 0.045 0.758(0.051) 0.850(0.039) 35.748 37.747
Time (s) 3019.9(21.1) 3032.9 3047.9(17.8) 3033.9(8.5) 1.1 0.2
\cdashline2-8 Worst Gap 0.035 0.045 0.842 0.913 35.748 37.747
h=40 LB 261.713(0.000) 261.713 261.713(0.000) 261.713(0.000) 261.711 261.711
Gap 0.129(0.002) 0.170 2.087(0.058) 2.287(0.092) 73.452 75.449
Time (s) 3022.0(9.3) 3028.8 3069.6(26.9) 3019.3(22.2) 2.3 0.5
\cdashline2-8 Worst Gap 0.131 0.170 2.170 2.447 73.452 75.449
Table D.2: Experiment results for the network problem. The target gap is 0.010.010.010.01. Bolded numbers indicate the largest upper bounds and smallest gaps, values in parentheses are standard deviations.
Max-gap GP-UCB Max-gap sawtooth Random GP-UCB Random sawtooth Fixed-grid GP-UCB Fixed-grid sawtooth
h=10 LB 151.180(0.000) 151.180 151.180(0.000) 151.180(0.000) 150.456 150.456
Gap 0.010(0.000) 0.010 1.348(0.232) 1.846(0.357) 104.631 141.734
Time (s) 58.8(7.2) 52.9 3014.2(11.9) 3012.2(7.0) 0.7 0.3
\cdashline2-8 Worst Gap 0.010 0.010 1.983 2.365 104.631 141.734
h=15 LB 224.616(0.000) 224.616 224.608(0.009) 224.576(0.079) 221.643 221.643
Gap 0.025(0.002) 0.039 5.888(0.542) 8.076(0.830) 176.103 229.092
Time (s) 3011.7(13.5) 3011.3 3018.5(15.1) 3012.9(10.4) 1.2 0.5
\cdashline2-8 Worst Gap 0.031 0.039 6.535 9.748 176.103 229.092
h=20 LB 298.149(0.000) 298.149 298.101(0.032) 298.053(0.084) 293.685 293.685
Gap 0.164(0.024) 0.346 12.579(1.042) 15.545(0.995) 248.946 315.601
Time (s) 3033.7(20.6) 3029.9 3029.0(21.6) 3011.9(11.1) 1.7 0.7
\cdashline2-8 Worst Gap 0.216 0.346 13.854 17.795 248.946 315.601
h=40 LB 592.274(0.003) 592.274 591.878(0.150) 591.719(0.229) 582.704 582.704
Gap 1.453(0.018) 2.419 42.844(1.844) 50.222(1.503) 543.012 660.788
Time (s) 3030.6(36.2) 3020.2 3015.0(15.7) 3020.2(13.1) 3.2 1.5
\cdashline2-8 Worst Gap 1.488 2.419 45.825 53.214 543.012 660.788
Table D.3: Experiment results for the query problem. The target gap is 0.0010.0010.0010.001. Bolded numbers indicate the largest upper bounds and smallest gaps, values in parentheses are standard deviations. The fixed-grid method is unable to solve the problem within the limited 3000 seconds.
Max-gap GP-UCB Max-gap sawtooth Random GP-UCB Random sawtooth Fixed-grid GP-UCB Fixed-grid sawtooth
h=10 LB 30.021(0.000) 30.021 30.021(0.000) 30.021(0.000) NA NA
Gap 0.001(0.000) 0.001 0.001(0.000) 0.001(0.000) NA NA
Time (s) 5.7(0.6) 80.7 7.3(2.5) 3030.8(7.9) NA NA
\cdashline2-8 Worst Gap 0.001 0.001 0.001 0.001 NA NA
h=15 LB 45.047(0.000) 45.047 45.047(0.000) 45.047(0.000) NA NA
Gap 0.001(0.000) 0.002 0.001(0.000) 0.004(0.000) NA NA
Time (s) 8.6(0.6) 3000.8 30.6(8.4) 3036.2(13.9) NA NA
\cdashline2-8 Worst Gap 0.001 0.002 0.001 0.004 NA NA
h=20 LB 60.083(0.000) 60.083 60.083(0.000) 60.083(0.000) NA NA
Gap 0.001(0.000) 0.004 0.001(0.000) 0.007(0.000) NA NA
Time (s) 3007.6(4.8) 3025.6 206.2(72.9) 3039.5(21.0) NA NA
\cdashline2-8 Worst Gap 0.001 0.004 0.001 0.007 NA NA
h=40 LB 120.327(0.000) 120.327 120.327(0.000) 120.327(0.000) NA NA
Gap 0.017(0.001) 0.026 0.010(0.001) 0.037(0.000) NA NA
Time (s) 3004.4(5.6) 3049.5 3018.8(32.9) 3027.4(30.6) NA NA
\cdashline2-8 Worst Gap 0.018 0.026 0.010 0.037 NA NA
Table D.4: Experiment results for the hallway problem. The target gap is 0.000010.000010.000010.00001. Bolded numbers indicate the largest upper bounds and smallest gaps, values in parentheses are standard deviations. The fixed-grid method is unable to solve the problem within the limited 3000 seconds.
Max-gap GP-UCB Max-gap sawtooth Random GP-UCB Random sawtooth Fixed-grid GP-UCB Fixed-grid sawtooth
h=10 LB 0.311(0.004) 0.311 0.327(0.002) 0.326(0.003) NA NA
Gap 0.162(0.000) 0.168 0.188(0.005) 0.180(0.009) NA NA
Time (s) 3067.0(27.9) 3034.5 3054.3(36.8) 3065.6(47.5) NA NA
\cdashline2-8 Worst Gap 0.162 0.168 0.196 0.196 NA NA
h=15 LB 0.604(0.001) 0.587 0.591(0.012) 0.587(0.011) NA NA
Gap 0.334(0.001) 0.360 0.384(0.013) 0.387(0.010) NA NA
Time (s) 2989.8(24.5) 3082.8 3086.2(57.0) 3062.6(33.9) NA NA
\cdashline2-8 Worst Gap 0.336 0.360 0.396 0.391 NA NA
h=20 LB 0.865(0.001) 0.858 0.811(0.024) 0.796(0.023) NA NA
Gap 0.507(0.001) 0.524 0.604(0.030) 0.615(0.026) NA NA
Time (s) 3022.0(25.1) 2883.4 3010.1(37.3) 3063.4(41.6) NA NA
\cdashline2-8 Worst Gap 0.508 0.524 0.646 0.635 NA NA
h=40 LB 1.930(0.000) 1.912 1.612(0.067) 1.295(0.061) NA NA
Gap 1.220(0.000) 1.234 1.451(0.070) 1.510(0.061) NA NA
Time (s) 2871.8(18.9) 3030.7 3092.7(66.6) 3069.2(21.0) NA NA
\cdashline2-8 Worst Gap 1.220 1.234 1.571 1.638 NA NA
Table D.5: Experiment results for the aloha30 problem. The target gap is 0.010.010.010.01. Bolded numbers indicate the largest upper bounds and smallest gaps, values in parentheses are standard deviations. The fixed-grid method is unable to solve the problem within the limited 3000 seconds.
Max-gap GP-UCB Max-gap sawtooth Random GP-UCB Random sawtooth Fixed-grid GP-UCB Fixed-grid sawtooth
h=10 LB 122.527(0.000) 122.501 122.545(0.006) 122.517(0.016) NA NA
Gap 0.735(0.000) 0.736 0.877(0.072) 0.963(0.121) NA NA
Time (s) 3042.3(6.2) 2962.7 3035.4(21.3) 3004.7(22.7) NA NA
\cdashline2-8 Worst Gap 0.735 0.736 0.979 1.177 NA NA
h=15 LB 167.681(0.000) 167.591 167.784(0.017) 167.625(0.137) NA NA
Gap 1.836(0.000) 2.004 2.026(0.132) 2.460(0.229) NA NA
Time (s) 3050.9(21.2) 2912.5 3034.1(29.2) 3101.9(77.6) NA NA
\cdashline2-8 Worst Gap 1.836 2.004 2.221 3.017 NA NA
h=20 LB 203.827(0.004) 203.657 204.090(0.045) 203.765(0.203) NA NA
Gap 3.248(0.007) 4.003 3.661(0.095) 4.417(0.317) NA NA
Time (s) 3061.5(12.4) 3070.2 3073.9(32.9) 3134.4(33.9) NA NA
\cdashline2-8 Worst Gap 3.262 4.003 3.804 5.134 NA NA
h=40 LB 282.129(0.000) 282.039 283.338(0.249) 282.329(0.577) NA NA
Gap 10.428(0.003) 12.507 11.164(0.632) 12.480(0.932) NA NA
Time (s) 2956.0(27.2) 3086.8 3084.9(45.8) 3163.7(39.4) NA NA
\cdashline2-8 Worst Gap 10.435 12.507 12.113 14.570 NA NA
Refer to caption
Figure D.1: Upper and lower bounds of the value function for the five examples. Dots represent upper and lower bounds from FiVI, while crosses indicate those from AUCB, color-coded as shown in the legend.