Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2888116.2888134guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

High confidence off-policy evaluation

Published: 25 January 2015 Publication History

Abstract

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.

References

[1]
Anderson, T. W. 1969. Confidence limits for the value of an arbitrary bounded random variable with a continuous distribution function. Bulletin of The International and Statistical Institute 43:249-251.
[2]
Diouf, M. A., and Dufour, J. M. 2005. Improved nonparametric inference for the mean of a bounded random variable with applicaiton to poverty measures.
[3]
Dvoretzky, A.; Kiefer, J.; and Wolfowitz, J. 1956. Asymptotic minimax character of a sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics 27:642-669.
[4]
Hauskrecht, M., and Fraser, H. 2000. Planning treatment of ischemic heart disease with partially observable markov decision processes. Artificial Intelligence in Medicine 18:221-244.
[5]
Li, L.; Chu, W.; Langford, J.; and Schapire, R. E. 2010. A contextual-bandit approach to personalized news article recommendation. In WWW, 661-670.
[6]
Li, L.; Chu, W.; Langford, J.; and Wang, X. 2011. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the Fourth International Conference on Web Search and Web Data Mining, 297-306.
[7]
Liu, B.; Mahadevan, S.; and Liu, J. 2012. Regularized off-policy TD-learning. In Advances in Neural Information Processing Systems.
[8]
Maei, H. R., and Sutton, R. S. 2010. GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, 91-96.
[9]
Mandel, T.; Liu, Y.; Levine, S.; Brunskill, E.; and Popović, Z. 2014. Offline policy evaluation across representations with applications to educational games. In Proceedings of the 13th International Conference on Autonomous Agents and Multiagent Systems.
[10]
Massart, P. 1990. The tight constraint in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability 18(3):1269-1283.
[11]
Massart, P. 2007. Concentration Inequalities and Model Selection. Springer.
[12]
Maurer, A., and Pontil, M. 2009. Empirical Bernstein bounds and sample variance penalization. In Proceedings of the Twenty-Second Annual Conference on Learning Theory, 115-124.
[13]
Moore, B.; Panousis, P.; Kulkarni, V.; Pyeatt, L.; and Doufas, A. 2010. Reinforcement learning for closed-loop propofol anesthesia: A human volunteer study. In Innovative Applications of Artificial Intelligence, 1807-1813.
[14]
Pilarski, P. M.; Dawson, M. R.; Degris, T.; Fahimi, F.; Carey, J. P.; and Sutton, R. S. 2011. Online human training of a my-oelectric prosthesis controller via actor-critic reinforcement learning. In Proceedings of the 2011 IEEE International Conference on Rehabilitation Robotics, 134-140.
[15]
Precup, D.; Sutton, R. S.; and Singh, S. 2000. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, 759-766.
[16]
Silver, D.; Newnham, L.; Barker, D.; Weller, S.; and McFall, J. 2013. Concurrent reinforcement learning from customer interactions. In The Thirtieth International Conference on Machine Learning.
[17]
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
[18]
Theocharous, G., and Hallak, A. 2013. Lifetime value marketing using reinforcement learning. In The 1st Multidis-ciplinary Conference on Reinforcement Learning and Decision Making.
[19]
Thomas, P. S.; Branicky, M. S.; van den Bogert, A. J.; and Jagodnik, K. M. 2009. Application of the actor-critic architecture to functional electrical stimulation control of a human arm. In Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence, 165-172.
[20]
Wilcox, R. R., and Keselman, H. J. 2003. Modern robust data analysis methods: Measures of central tendency. Psychological Methods 8(3):254-274.

Cited By

View all
  • (2023)Hallucinated adversarial control for conservative offline policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3626000(1774-1784)Online publication date: 31-Jul-2023
  • (2023)On the relation between policy improvement and off-policy minimum-variance policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625968(1423-1433)Online publication date: 31-Jul-2023
  • (2023)Near-optimal conservative exploration in reinforcement learning under episode-wise constraintsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619216(19527-19564)Online publication date: 23-Jul-2023
  • Show More Cited By
  1. High confidence off-policy evaluation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Guide Proceedings
    AAAI'15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
    January 2015
    4331 pages
    ISBN:0262511290

    Sponsors

    • Association for the Advancement of Artificial Intelligence

    Publisher

    AAAI Press

    Publication History

    Published: 25 January 2015

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 11 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Hallucinated adversarial control for conservative offline policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3626000(1774-1784)Online publication date: 31-Jul-2023
    • (2023)On the relation between policy improvement and off-policy minimum-variance policy evaluationProceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence10.5555/3625834.3625968(1423-1433)Online publication date: 31-Jul-2023
    • (2023)Near-optimal conservative exploration in reinforcement learning under episode-wise constraintsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619216(19527-19564)Online publication date: 23-Jul-2023
    • (2022)Off-policy evaluation with deficient support using side informationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602463(30250-30264)Online publication date: 28-Nov-2022
    • (2022)Model-based offline reinforcement learning with pessimism-modulated dynamics beliefProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600303(449-461)Online publication date: 28-Nov-2022
    • (2020)Maximum entropy gain exploration for long horizon multi-goal reinforcement learningProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525656(7750-7761)Online publication date: 13-Jul-2020
    • (2020)Interpretable off-policy evaluation in reinforcement learning by highlighting influential transitionsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525281(3658-3667)Online publication date: 13-Jul-2020
    • (2020)Accountable off-policy evaluation with kernel Bellman statisticsProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525229(3102-3111)Online publication date: 13-Jul-2020
    • (2020)Off-policy policy evaluation for sequential decisions under unobserved confoundingProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3497304(18819-18831)Online publication date: 6-Dec-2020
    • (2020)CoinDICEProceedings of the 34th International Conference on Neural Information Processing Systems10.5555/3495724.3496512(9398-9411)Online publication date: 6-Dec-2020
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media