Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

A perspective on off-policy evaluation in reinforcement learning

  • Perspective
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Bottou L, Peters J, Quiñonero-Candela J, Charles D X, Chickering D M, Portugaly E, Ray D, Simard P, Snelson E. Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research, 2013, 14(1): 3207–3260

    MathSciNet  MATH  Google Scholar 

  2. Hofmann K, Li L, Radlinski F. Online evaluation for information retrieval. Foundations and Trends in Information Retrieval, 2016, 10(1): 1–117

    Article  Google Scholar 

  3. Li L, Chu W, Langford J, Schapire R E. A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 661–670

    Chapter  Google Scholar 

  4. Dudík M, Langford J, Li L. Doubly robust policy evaluation and learning. In: Proceedings of the 28th International Conference on Machine Learning. 2011, 1097–1104

    Google Scholar 

  5. Swaminathan A, Joachims T. The self-normalized estimator for counterfactual learning. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 3231–3239

    Google Scholar 

  6. Wang Y X, Agarwal A, Dudík M. Optimal and adaptive off-policy evaluation in contextual bandits. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 3589–3597

    Google Scholar 

  7. Jiang N, Li L. Doubly robust off-policy evaluation for reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning. 2016, 652–661

    Google Scholar 

  8. Li L, Munos R, Szepesvári C. Toward minimax off-policy value estimation. In: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics. 2015, 608–616

    Google Scholar 

  9. Precup D, Sutton R S, Singh S P. Eligibility traces for off-policy policy evaluation. In: Proceedings of the 17th International Conference on Machine Learning. 2000, 759–766

    Google Scholar 

  10. Liu Q, Li L, Tang Z, Zhou D. Breaking the curse of horizon: infinite-horizon off-policy estimation. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2018, 5361–5371

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lihong Li.

Additional information

Lihong Li is a research scientist at Google Brain, USA. Previously, he held research positions at Yahoo! Research (Silicon Valley) and Microsoft Research (Redmond). His main research interests are in reinforcement learning, including contextual bandits, and other related problems in AI. His work has found applications in recommendation, advertising, Web search and conversation systems, and has won best paper awards at ICML, AISTATS and WSDM. He serves as area chair or senior program committee member at major AI/ML conferences such as AAAI, ICLR, ICML, IJCAI and NIPS/NeurIPS.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L. A perspective on off-policy evaluation in reinforcement learning. Front. Comput. Sci. 13, 911–912 (2019). https://doi.org/10.1007/s11704-019-9901-7

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-019-9901-7