Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/3495724.3497304guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article
Free access

Off-policy policy evaluation for sequential decisions under unobserved confounding

Published: 06 December 2020 Publication History

Abstract

When observed decisions depend only on observed features, off-policy policy evaluation (OPE) methods for sequential decision problems can estimate the performance of evaluation policies before deploying them. However, this assumption is frequently violated due to unobserved confounders, unrecorded variables that impact both the decisions and their outcomes. We assess robustness of OPE methods under unobserved confounding by developing worst-case bounds on the performance of an evaluation policy. When unobserved confounders can affect every decision in an episode, we demonstrate that even small amounts of per-decision confounding can heavily bias OPE methods. Fortunately, in a number of important settings found in healthcare, policy-making, and technology, unobserved confounders may directly affect only one of the many decisions made, and influence future decisions/rewards only through the directly affected decision. Under this less pessimistic model of one-decision confounding, we propose an efficient loss-minimization-based procedure for computing worst-case bounds, and prove its statistical consistency. On simulated healthcare examples—management of sepsis and interventions for autistic children—where this is a reasonable model, we demonstrate that our method invalidates non-robust results and provides meaningful certificates of robustness, allowing reliable selection of policies under unobserved confounding.

References

[1]
D. K. Anderson, R. S. Oti, C. Lord, and K. Welch. Patterns of growth in adaptive social abilities among children with autism spectrum disorders. Journal of Abnormal Child Psychology, 37(7):1019-1034, Oct 2009. ISSN 1573-2835.
[2]
A. J. Brent. Meta-analysis of time to antimicrobial therapy in sepsis: Confounding as well as bias. Critical Care Medicine, 45(2), 2017.
[3]
B. A. Brumback, M. A. Hernán, S. J. P. A. Haneuse, and J. M. Robins. Sensitivity analyses for unmeasured confounding assuming a marginal structural model for repeated measures. Statistics in Medicine, 23(5):749-767, 2004.
[4]
J. Cornfield, W. Haenszel, E. C. Hammond, A. M. Lilienfeld, M. B. Shimkin, and E. L. Wynder. Smoking and lung cancer: Recent evidence and a discussion of some questions. Journal of the National Cancer Institute, 22(1):173-203, 1959.
[5]
S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, 2019.
[6]
J. Futoma, A. Lin, M. Sendak, A. Bedoya, M. Clement, C. O'Brien, and K. Heller. Learning to treat sepsis with multi-output gaussian process deep recurrent q-networks, 2018. URL https://openreview.net/forum?id=SyxCqGbRZ.
[7]
O. Gottesman, F. Johansson, M. Komorowski, A. Faisal, D. Sontag, F. Doshi-Velez, and L. A. Celi. Guidelines for reinforcement learning in healthcare. Nature Medicine, 25(1):16-18, 2019.
[8]
O. Gottesman, Y. Liu, S. Sussex, E. Brunskill, and F. Doshi-Velez. Combining parametric and nonparametric models for off-policy evaluation. In International Conference on Machine Learning, pages 2366-2375, 2019.
[9]
A. Hallak and S. Mannor. Consistent on-line off-policy evaluation. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1372-1383. JMLR. org, 2017.
[10]
J. Hanna, S. Niekum, and P. Stone. Importance sampling policy evaluation with an estimated behavior policy. In Proceedings of the 36th International Conference on Machine Learning (ICML), June 2019.
[11]
J. P. Hanna, P. Stone, and S. Niekum. Bootstrapping with models: Confidence intervals for off-policy evaluation. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[12]
M. Hernán and J. Robins. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC, 2020.
[13]
M. D. Howell and A. M. Davis. Management of Sepsis and Septic Shock. JAMA, 317(8): 847-848, 02 2017. ISSN 0098-7484.
[14]
T.-C. Hu, F. Moricz, and R. Taylor. Strong laws of large numbers for arrays of rowwise independent random variables. Acta Mathematica Hungarica, 54(1-2):153-162, 1989.
[15]
G. W. Imbens. Sensitivity to exogeneity assumptions in program evaluation. American Economic Review, 93(2):126-132, 2003.
[16]
N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2015.
[17]
A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
[18]
J. Jung, R. Shroff, A. Feller, and S. Goel. Algorithmic decision making in the presence of unmeasured confounding. arXiv:1805.01868 [stat.ME], 2018.
[19]
N. Kallus and M. Uehara. Double reinforcement learning for efficient off-policy evaluation in markov decision processes. arXiv preprint arXiv:1908.08526, 2019.
[20]
N. Kallus and A. Zhou. Confounding-robust policy improvement. In Advances in Neural Information Processing Systems, pages 9269-9279, 2018.
[21]
N. Kallus and A. Zhou. Confounding-robust policy evaluation in infinite-horizon reinforcement learning. arXiv preprint arXiv:2002.04518, 2020.
[22]
N. Kallus, X. Mao, and A. Zhou. Interval estimation of individual-level causal effects under unobserved confounding. arXiv preprint arXiv:1810.02894, 2018.
[23]
C. Kasari, A. Kaiser, K. Goods, J. Nietfeld, P. Mathy, R. Landa, S. Murphy, and D. Almirall. Communication interventions for minimally verbal children with autism: A sequential multiple assignment randomized trial. Journal of the American Academy of Child & Adolescent Psychiatry, 53(6):635-646, 2014.
[24]
A. J. King and R. J. Wets. Epi-consistency of convex stochastic programs. Stochastics and Stochastic Reports, 34(1-2):83-92, 1991.
[25]
M. Komorowski, L. A. Celi, O. Badawi, A. C. Gordon, and A. A. Faisal. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716-1720, 2018.
[26]
A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11761-11771, 2019.
[27]
H. Le, C. Voloshin, and Y. Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pages 3703-3712, 2019.
[28]
Q. Liu, L. Li, Z. Tang, and D. Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Advances in Neural Information Processing Systems 31, pages 5356-5366, 2018.
[29]
Y. Liu, O. Gottesman, A. Raghu, M. Komorowski, A. A. Faisal, F. Doshi-Velez, and E. Brun-skill. Representation balancing mdps for off-policy policy evaluation. In Advances in Neural Information Processing Systems, pages 2644-2653, 2018.
[30]
C. Lu, B. Schölkopf, and J. M. Hernández-Lobato. Deconfounding reinforcement learning in observational settings. arXiv preprint arXiv:1812.10576, 2018.
[31]
X. Lu, I. Nahum-Shani, C. Kasari, K. G. Lynch, D. W. Oslin, W. E. Pelham, G. Fabiano, and D. Almirall. Comparing dynamic treatment regimes using repeated-measures outcomes: modeling considerations in smart studies. Statistics in medicine, 35(10):1595-1615, 2016.
[32]
D. Luenberger. Optimization by Vector Space Methods. Wiley, 1969.
[33]
C. F. Manski. Nonparametric bounds on treatment effects. The American Economic Review, 80 (2):319-323, 1990.
[34]
C. J. McDonald. Medical heuristics: the silent adjudicators of clinical practice. Annals of Internal Medicine, 124(1_Part_1):56-62, 1996.
[35]
W. Miao, Z. Geng, and E. J. Tchetgen Tchetgen. Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987-993, 2018.
[36]
S. A. Murphy. Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):331-355, 2003.
[37]
X. Nie, E. Brunskill, and S. Wager. Learning when-to-treat policies. arXiv preprint arXiv:1905.09751, 2019.
[38]
M. Oberst and D. Sontag. Counterfactual off-policy evaluation with Gumbel-max structural causal models. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4881-4890, Long Beach, California, USA, 09-15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/oberst19a.html.
[39]
J. Pearl. Causality. Cambridge University Press, 2009.
[40]
A. Raghu, M. Komorowski, L. A. Celi, P. Szolovits, and M. Ghassemi. Continuous state-space models for optimal sepsis treatment-a deep reinforcement learning approach. arXiv preprint arXiv:1705.08422, 2017.
[41]
J. Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical modelling, 7 (9-12):1393-1512, 1986.
[42]
J. M. Robins. Optimal structural nested models for optimal sequential decisions. In Proceedings of the Second Seattle Symposium in Biostatistics, pages 189-326. Springer, 2004.
[43]
J. M. Robins, A. Rotnitzky, and D. O. Scharfstein. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. In M. E. Halloran and D. Berry, editors, Statistical Models in Epidemiology, the Environment, and Clinical Trials, pages 1-94, New York, NY, 2000. Springer New York. ISBN 978-1-4612-1284-3.
[44]
R. T. Rockafellar and R. J. B. Wets. Variational Analysis. Springer, New York, 1998.
[45]
P. R. Rosenbaum. Observational studies. In Observational studies, pages 1-17. Springer, 2002.
[46]
P. R. Rosenbaum. Design of Observational Studies, volume 10. Springer, 2010.
[47]
P. R. Rosenbaum and D. B. Rubin. Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome. Journal of the Royal Statistical Society: Series B (Methodological), 45(2):212-218, 1983.
[48]
D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322-331, 2005.
[49]
M. Rutter, D. Greenfeld, and L. Lockyer. A five to fifteen year follow-up study of infantile psychosis: Ii. social and behavioural outcome. British Journal of Psychiatry, 113(504):1183-1199, 1967.
[50]
C. W. Seymour, F. Gesten, H. C. Prescott, M. E. Friedrich, T. J. Iwashyna, G. S. Phillips, S. Lemeshow, T. Osborn, K. M. Terry, and M. M. Levy. Time to treatment and mortality during mandated emergency care for sepsis. New England Journal of Medicine, 376(23):2235-2244, 2017. 28528569.
[51]
S. A. Sterling, W. R. Miller, J. Pryor, M. A. Puskarich, and A. E. Jones. The impact of timing of antibiotics on outcomes in severe sepsis and septic shock: a systematic review and meta-analysis. Critical care medicine, 43(9):1907, 2015.
[52]
G. Tennenholtz, S. Mannor, and U. Shalit. Off-policy evaluation in partially observable environments. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
[53]
P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139-2148, 2016.
[54]
P. S. Thomas, G. Theocharous, and M. Ghavamzadeh. High-confidence off-policy evaluation. In In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[55]
P. S. Thomas, B. C. da Silva, A. G. Barto, S. Giguere, Y. Brun, and E. Brunskill. Preventing undesirable behavior of intelligent machines. Science, 366(6468):999-1004, 2019.
[56]
T. Xie, Y. Ma, and Y.-X. Wang. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. In Advances in Neural Information Processing Systems 32, pages 9665-9675, 2019.
[57]
S. Yadlowsky, H. Namkoong, S. Basu, J. Duchi, and L. Tian. Bounds on the conditional and average treatment effect in the presence of unobserved confounders. arXiv:1808.09521 [stat.ME], 2018.
[58]
J. Zhang and E. Bareinboim. Near-optimal reinforcement learning in dynamic treatment regimes. In Advances in Neural Information Processing Systems 32, pages 13401-13411, 2019.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems
December 2020
22651 pages
ISBN:9781713829546

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2020

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 70
    Total Downloads
  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)30
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media