Article

High confidence off-policy evaluation

Authors:

Philip S. Thomas,

Georgios Theocharous,

Mohammad GhavamzadehAuthors Info & Claims

AAAI'15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

Pages 3000 - 3006

Published: 25 January 2015 Publication History

Abstract

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.

References

[1]

Anderson, T. W. 1969. Confidence limits for the value of an arbitrary bounded random variable with a continuous distribution function. Bulletin of The International and Statistical Institute 43:249-251.

[2]

Diouf, M. A., and Dufour, J. M. 2005. Improved nonparametric inference for the mean of a bounded random variable with applicaiton to poverty measures.

[3]

Dvoretzky, A.; Kiefer, J.; and Wolfowitz, J. 1956. Asymptotic minimax character of a sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics 27:642-669.

[4]

Hauskrecht, M., and Fraser, H. 2000. Planning treatment of ischemic heart disease with partially observable markov decision processes. Artificial Intelligence in Medicine 18:221-244.

[5]

Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW, 661-670.

[6]

Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth International Conference on Web Search and Web Data Mining, 297-306.

[7]

Liu, B.; Mahadevan, S.; and Liu, J. 2012. Regularized off-policy TD-learning. In Advances in Neural Information Processing Systems.

[8]

Maei, H. R., and Sutton, R. S. 2010. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, 91-96.

[9]

Mandel, T.; Liu, Y.; Levine, S.; Brunskill, E.; and Popović, Z. 2014. Offline policy evaluation across representations with applications to educational games. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems.

[10]

Massart, P. 1990. The tight constraint in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability 18(3):1269-1283.

[11]

Massart, P. 2007. Concentration Inequalities and Model Selection. Springer.

[12]

Maurer, A., and Pontil, M. 2009. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the Twenty-Second Annual Conference on Learning Theory, 115-124.

[13]

Moore, B.; Panousis, P.; Kulkarni, V.; Pyeatt, L.; and Doufas, A. 2010. Reinforcement learning for closed-loop propofol anesthesia: A human volunteer study. In Innovative Applications of Artificial Intelligence, 1807-1813.

[14]

Pilarski, P. M.; Dawson, M. R.; Degris, T.; Fahimi, F.; Carey, J. P.; and Sutton, R. S. 2011. Online human training of a my-oelectric prosthesis controller via actor-critic reinforcement learning. In Proceedings of the 2011 IEEE International Conference on Rehabilitation Robotics, 134-140.

[15]

Precup, D.; Sutton, R. S.; and Singh, S. 2000. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759-766.

[16]

Silver, D.; Newnham, L.; Barker, D.; Weller, S.; and McFall, J. 2013. Concurrent reinforcement learning from customer interactions. In The Thirtieth International Conference on Machine Learning.

[17]

Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

[18]

Theocharous, G., and Hallak, A. 2013. Lifetime value marketing using reinforcement learning. In The 1st Multidis-ciplinary Conference on Reinforcement Learning and Decision Making.

[19]

Thomas, P. S.; Branicky, M. S.; van den Bogert, A. J.; and Jagodnik, K. M. 2009. Application of the actor-critic architecture to functional electrical stimulation control of a human arm. In Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence, 165-172.

[20]

Wilcox, R. R., and Keselman, H. J. 2003. Modern robust data analysis methods: Measures of central tendency. Psychological Methods 8(3):254-274.

Cited By

Rothfuss JSukhija BBirchler TKassraie PKrause AEvans RShpitser I(2023)Hallucinated adversarial control for conservative offline policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3626000(1774-1784)Online publication date: 31-Jul-2023
https://dl.acm.org/doi/10.5555/3625834.3626000
Metelli AMeta SRestelli MEvans RShpitser I(2023)On the relation between policy improvement and off-policy minimum-variance policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625968(1423-1433)Online publication date: 31-Jul-2023
https://dl.acm.org/doi/10.5555/3625834.3625968
Li DHuang RShen CYang JKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Near-optimal conservative exploration in reinforcement learning under episode-wise constraintsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619216(19527-19564)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619216
Show More Cited By

High confidence off-policy evaluation
1. Computing methodologies

Recommendations

Towards high performance security policy evaluation

The Enterprise Privacy Authorization Language (EPAL) is a formal language for specifying fine-grained enterprise privacy policies. With the adoption of EPAL, especially in web applications, the performance of EPAL policy evaluation engines becomes a ...
Designing Fast and Scalable XACML Policy Evaluation Engines

Most prior research on policies has focused on correctness. While correctness is an important issue, the adoption of policy-based computing may be limited if the resulting systems are not implemented efficiently and thus perform poorly. To increase the ...
Defining admissible rewards for high-confidence policy evaluation in batch reinforcement learning
CHIL '20: Proceedings of the ACM Conference on Health, Inference, and Learning

A key impediment to reinforcement learning (RL) in real applications with limited, batch data is in defining a reward function that reflects what we implicitly know about reasonable behaviour for a task and allows for robust off-policy evaluation. In ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

AAAI'15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence

January 2015

4331 pages

ISBN:0262511290

Sponsors

Association for the Advancement of Artificial Intelligence

Publisher

AAAI Press

Publication History

Published: 25 January 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rothfuss JSukhija BBirchler TKassraie PKrause AEvans RShpitser I(2023)Hallucinated adversarial control for conservative offline policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3626000(1774-1784)Online publication date: 31-Jul-2023
https://dl.acm.org/doi/10.5555/3625834.3626000
Metelli AMeta SRestelli MEvans RShpitser I(2023)On the relation between policy improvement and off-policy minimum-variance policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625968(1423-1433)Online publication date: 31-Jul-2023
https://dl.acm.org/doi/10.5555/3625834.3625968
Li DHuang RShen CYang JKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)Near-optimal conservative exploration in reinforcement learning under episode-wise constraintsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619216(19527-19564)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619216
Felicioni NDacrema MRestelli MCremonesi PKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Off-policy evaluation with deficient support using side informationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602463(30250-30264)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3602463
Guo KShao YGeng YKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Model-based offline reinforcement learning with pessimism-modulated dynamics beliefProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600303(449-461)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3600303
Pitis SChan HZhao SStadie BBa JDaumé HSingh A(2020)Maximum entropy gain exploration for long horizon multi-goal reinforcement learningProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525656(7750-7761)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3524938.3525656
Gottesman OFutoma JLiu YParbhoo SCeli LBrunskill EDoshi-Velez FDaumé HSingh A(2020)Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitionsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525281(3658-3667)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3524938.3525281
Feng YRen TTang ZLiu QDaumé HSingh A(2020)Accountable off-policy evaluation with kernel Bellman statisticsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525229(3102-3111)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.5555/3524938.3525229
Namkoong HKeramati RYadlowsky SBrunskill ELarochelle HRanzato MHadsell RBalcan MLin H(2020)Off-policy policy evaluation for sequential decisions under unobserved confoundingProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497304(18819-18831)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3497304
Dai BNachum OChow YLi LSzepesvári CSchuurmans DLarochelle HRanzato MHadsell RBalcan MLin H(2020)CoinDICEProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3496512(9398-9411)Online publication date: 6-Dec-2020
https://dl.acm.org/doi/10.5555/3495724.3496512
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten