1932

Abstract

Q-learning, originally an incremental algorithm for estimating an optimal decision strategy in an infinite-horizon decision problem, now refers to a general class of reinforcement learning methods widely used in statistics and artificial intelligence. In the context of personalized medicine, finite-horizon Q-learning is the workhorse for estimating optimal treatment strategies, known as treatment regimes. Infinite-horizon Q-learning is also increasingly relevant in the growing field of mobile health. In computer science, Q-learning methods have achieved remarkable performance in domains such as game-playing and robotics. In this article, we () review the history of Q-learning in computer science and statistics, () formalize finite-horizon Q-learning within the potential outcomes framework and discuss the inferential difficulties for which it is infamous, and () review variants of infinite-horizon Q-learning and the exploration-exploitation problem, which arises in decision problems with a long time horizon. We close by discussing issues arising with the use of Q-learning in practice, including arguments for combining Q-learning with direct-search methods; sample size considerations for sequential, multiple assignment randomized trials; and possibilities for combining Q-learning with model-based methods.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-statistics-031219-041220
2020-03-07
2024-10-12
Loading full text...

Full text loading...

/deliver/fulltext/statistics/7/1/annurev-statistics-031219-041220.html?itemId=/content/journals/10.1146/annurev-statistics-031219-041220&mimeType=html&fmt=ahah

Literature Cited

  1. Andrews DW. 2000. Inconsistency of the bootstrap when a parameter is on the boundary of the parameter space. Econometrica 68:399–405
    [Google Scholar]
  2. Auer P. 2002. Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3:397–422
    [Google Scholar]
  3. Baird L. 1995. Residual algorithms: reinforcement learning with function approximation. Machine Learning Proceedings 1995 A Prieditis, S Russell 30–37 San Francisco: Morgan Kaufmann
    [Google Scholar]
  4. Bellman R. 1957. Dynamic Programming Princeton, NJ: Princeton Univ. Press
    [Google Scholar]
  5. Bertsekas DP, Tsitsiklis J. 1996. Neuro-Dynamic Programming Nashua, NH: Athena Sci.
    [Google Scholar]
  6. Bowling M, Burch N, Johanson M, Tammelin O 2015. Heads-up limit hold'em poker is solved. Science 347:145–49
    [Google Scholar]
  7. Chakraborty B, Laber EB, Zhao Y 2013. Inference for optimal dynamic treatment regimes using an adaptive m-out-of-n bootstrap scheme. Biometrics 69:714–23
    [Google Scholar]
  8. Chakraborty B, Laber EB, Zhao YQ 2014. Inference about the expected performance of a data-driven dynamic treatment regime. Clin. Trials 11:408–17
    [Google Scholar]
  9. Chakraborty B, Murphy S, Strecher V 2010. Inference for non-regular parameters in optimal dynamic treatment regimes. Stat. Methods Med. Res. 19:317–43
    [Google Scholar]
  10. Chakraborty B, Strecher V, Murphy S 2008. Bias correction and confidence intervals for fitted Q-iteration. Workshop on Model Uncertainty and Risk in Reinforcement Learning (NIPS 2008) https://cs.uwaterloo.ca/∼ppoupart/nips08-workshop/accepted-papers/nips08paper01-final.pdf
    [Google Scholar]
  11. Chatterjee S, Bose A et al. 2005. Generalized bootstrap for estimating equations. Ann. Stat. 33:414–36
    [Google Scholar]
  12. Dudík M, Erhan D, Langford J, Li L 2014. Doubly robust policy evaluation and optimization. Stat. Sci. 29:485–511
    [Google Scholar]
  13. Eckles D, Kaptein M. 2014. Thompson sampling with the online bootstrap. arXiv:1410.4009 [cs.LG]
  14. Ernst D, Geurts P, Wehenkel L 2005. Tree-based batch mode reinforcement learning. J. Mach. Learn. Res. 6:503–56
    [Google Scholar]
  15. Ertefaie A, Strawderman RL. 2018. Constructing dynamic treatment regimes over indefinite time horizons. Biometrika 105:963–77
    [Google Scholar]
  16. Feinberg V, Wan A, Stoica I, Jordan MI, Gonzalez JE, Levine S 2018. Model-based value estimation for efficient model-free reinforcement learning. arXiv:1803.00101 [cs.LG]
  17. Fortunato M, Azar MG, Piot B, Menick J, Osband I et al. 2017. Noisy networks for exploration. arXiv:1706.10295 [cs.LG]
  18. Geurts P, Ernst D, Wehenkel L 2006. Extremely randomized trees. Mach. Learn. 63:3–42
    [Google Scholar]
  19. Ghavamzadeh M, Mannor S, Pineau J, Tamar A 2015. Bayesian reinforcement learning: a survey. Found. Trends Mach. Learn. 8:359–483
    [Google Scholar]
  20. Hasselt HV. 2010. Double Q-learning. Advances in Neural Information Processing Systems 23 (NIPS 2010) JD Lafferty, CKI Williams, J Shawe-Taylor, RS Zemel, A Culotta 2613–21 https://papers.nips.cc/paper/3964-double-q-learning
    [Google Scholar]
  21. Hasselt HV, Guez A, Silver D 2016. Deep reinforcement learning with double q-learning. Thirtieth AAAI Conference on Artificial Intelligence2094–100 Menlo Park, CA: AAAI
    [Google Scholar]
  22. Hernán M, Robins J. 2019. Causal Inference Boca Raton: Chapman & Hall/CRC
    [Google Scholar]
  23. Hessel M, Modayil J, Van Hasselt H, Schaul T, Ostrovski G et al. 2018. Rainbow: combining improvements in deep reinforcement learning. Thirty-Second AAAI Conference on Artificial Intelligence3215–22 Menlo Park, CA: AAAI
    [Google Scholar]
  24. Hirano K, Porter JR. 2012. Impossibility results for nondifferentiable functionals. Econometrica 80:1769–90
    [Google Scholar]
  25. Jiang N, Li L. 2015. Doubly robust off-policy value evaluation for reinforcement learning. arXiv:1511.03722 [cs.LG]
  26. Kallus N, Zhou A. 2018. Confounding-robust policy improvement. Advances in Neural Information Processing Systems 31 (NIPS 2018) S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, R Garnett. https://papers.nips.cc/paper/8139-confounding-robust-policy-improvement
    [Google Scholar]
  27. Kalweit G, Boedecker J. 2017. Uncertainty-driven imagination for continuous deep reinforcement learning. PMLR 78:195–206
    [Google Scholar]
  28. Laber EB, Linn K, Stefanski L 2014a. Interactive model building for Q-learning. Biometrika 101:831–47
    [Google Scholar]
  29. Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA 2014b. Dynamic treatment regimes: technical challenges and applications. Electron. J. Stat. 8:1225
    [Google Scholar]
  30. Laber EB, Meyer N, Reich B, Pacifici J, Collazo J, Drake J 2018. On-line estimation of an optimal treatment allocation strategy for the control of emerging infectious disease. J. R. Stat. Soc. C 67:743–70
    [Google Scholar]
  31. Leeb H, Poetscher B. 2003. The finite-sample distribution of post-model-selection estimators and uniform versus nonuniform approximations. Econom. Theory 19:100–42
    [Google Scholar]
  32. Luckett D, Laber E, El-Kamary S, Fan C, Jhaveri R et al. 2018. Receiver operating characteristic curves and confidence bands for support vector machines. arXiv:1807.06711 [stat.ML]
  33. Luckett DJ, Laber EB, Kahkoska AR, Maahs DM, Mayer-Davis E, Kosorok MR 2019. Estimating dynamic treatment regimes in mobile health using V-learning. J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2018.1537919
    [Crossref] [Google Scholar]
  34. Maei HR, Szepesvári C, Bhatnagar S, Sutton RS 2010. Toward off-policy learning control with function approximation. Proceedings of the 27th International Conference on Machine Learning (ICML-10) J Fürnkranz, T Joachims 719–26 Madison, WI: Omnipress
    [Google Scholar]
  35. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J et al. 2015. Human-level control through deep reinforcement learning. Nature 518:529
    [Google Scholar]
  36. Moodie EE, Dean N, Sun YR 2014. Q-learning: flexible learning about useful utilities. Stat. Biosci. 6:223–43
    [Google Scholar]
  37. Moodie EE, Richardson TS, Stephens DA 2007. Demystifying optimal dynamic treatment regimes. Biometrics 63:447–55
    [Google Scholar]
  38. Moodie EE, Richardson TS, Stephens DA 2010. Estimating optimal dynamic regimes: correcting bias under the null. Biometrics 63:447–55
    [Google Scholar]
  39. Murphy SA. 2003. Optimal dynamic treatment regimes. J. R. Stat. Soc. B 65:331–55
    [Google Scholar]
  40. Murphy SA. 2005a. A generalization error for Q-learning. J. Mach. Learn. Res. 6:1073–97
    [Google Scholar]
  41. Murphy SA. 2005b. An experimental design for the development of adaptive treatment strategies. Stat. Med. 24:1455–81
    [Google Scholar]
  42. Murphy SA, Van Der Laan M, Robins J 2001. Marginal mean models for dynamic regimes. J. Am. Stat. Assoc. 96:1410–23
    [Google Scholar]
  43. Osband I, Russo D, Wen Z, Van Roy B 2017. Deep exploration via randomized value functions. arXiv:1703.07608 [stat.ML]
  44. Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY et al. 2017. Parameter space noise for exploration. arXiv:1706.01905 [cs.LG]
  45. Powell WB. 2007. Approximate Dynamic Programming: Solving the Curses of Dimensionality New York: Wiley
    [Google Scholar]
  46. Præstgaard J, Wellner JA. 1993. Exchangeably weighted bootstraps of the general empirical process. Ann. Probab. 21:2053–86
    [Google Scholar]
  47. Precup D, Sutton RS, Dasgupta S 2001. Off-policy temporal-difference learning with function approximation. Proceedings of the 18th International Conference on Machine Learning CE Brodley, AP Danyluk 417–24 San Francisco: Morgan Kaufmann
    [Google Scholar]
  48. Puterman ML. 2009. Markov Decision Processes: Discrete Stochastic Dynamic Programming New York: Wiley
    [Google Scholar]
  49. Qian M, Murphy S. 2011. Performance guarantees for individualized treatment rules. Ann. Stat. 39:1180–210
    [Google Scholar]
  50. Riedmiller M. 2005. Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. Machine Learning: ECML 2005 J Gama, R Camacho, PB Brazdil, A Jorge, L Torgo 317–28 New York: Springer
    [Google Scholar]
  51. Robins J. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Math. Model. 7:1393–512
    [Google Scholar]
  52. Robins JM. 1987. Addendum to a new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Comput. Math. Appl. 14:923–45
    [Google Scholar]
  53. Robins JM. 1989. The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. Health Serv. Res. Methodol. 113:159
    [Google Scholar]
  54. Robins JM. 1993. Information recovery and bias adjustment in proportional hazards regression analysis of randomized trials using surrogate markers. Proceedings of the Biopharmaceutical Section, American Statistical Association Vol 2424–33 Washington, DC: Am. Stat. Assoc.
    [Google Scholar]
  55. Robins JM. 2004. Optimal structural nested models for optimal sequential decisions. Proceedings of the Second Seattle Symposium in Biostatistics: Analysis of Correlated Data D Lin, PJ Heagerty 189–326 New York: Springer
    [Google Scholar]
  56. Rose EJ, Laber EB, Davidian M, Tsiatis AA, Zhao Y, Kosorok MR 2019. Sample size calculations for SMARTs. arXiv:1906.06646 [stat.ME]
  57. Rosenbaum PR, Rubin DB. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70:41–55
    [Google Scholar]
  58. Rubin DB. 1978. Bayesian inference for causal effects: the role of randomization. Ann. Stat. 6:34–58
    [Google Scholar]
  59. Rubin DB. 2005. Causal inference using potential outcomes: design, modeling, decisions. J. Am. Stat. Assoc. 100:322–31
    [Google Scholar]
  60. Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z et al. 2018. A tutorial on Thompson sampling. Found. Trends Mach. Learn. 11:1–96
    [Google Scholar]
  61. Schaul T, Quan J, Antonoglou I, Silver D 2015. Prioritized experience replay. arXiv:1511.05952 [cs.LG]
  62. Schulman J, Levine S, Abbeel P, Jordan M, Moritz P 2015. Trust region policy optimization. PMLR 37:1889–97
    [Google Scholar]
  63. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O 2017. Proximal policy optimization algorithms. arXiv:1707.06347 [cs.LG]
  64. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A et al. 2017. Mastering the game of go without human knowledge. Nature 550:354
    [Google Scholar]
  65. Smith JE, Winkler RL. 2006. The optimizers curse: skepticism and postdecision surprise in decision analysis. Manag. Sci. 52:311–22
    [Google Scholar]
  66. Song R, Wang W, Zeng D, Kosorok MR 2015. Penalized Q-learning for dynamic treatment regimens. Stat. Sin. 25:901
    [Google Scholar]
  67. Splawa-Neyman J, Dabrowska DM, Speed TP 1990. On the application of probability theory to agricultural experiments. Essay on principles. Section 9. Stat. Sci. 5:465–72
    [Google Scholar]
  68. Sutton RS. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Machine Learning Proceedings 1990 B Porter, R Mooney 216–24 Amsterdam: Elsevier
    [Google Scholar]
  69. Sutton RS, Barto AG. 2018. Reinforcement Learning: An Introduction Cambridge, MA: MIT Press, 2nd ed..
    [Google Scholar]
  70. Tewari A, Murphy SA. 2017. From ads to interventions: contextual bandits in mobile health. Mobile Health: Sensors, Analytic Methods, and Applications JM Rehg, SA Murphy, S Kumar 495–517 New York: Springer
    [Google Scholar]
  71. Thomas P, Brunskill E. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. Proceedings of the 33rd International Conference on International Conference on Machine Learning MF Balcan, KQ Weinberger 2139–48 Brookline, MA: Microtome
    [Google Scholar]
  72. Thompson WR. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25:285–94
    [Google Scholar]
  73. Tsiatis A. 2006. Semiparametric Theory and Missing Data New York: Springer
    [Google Scholar]
  74. van der Laan MJ, Petersen ML, Joffe MM 2005. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. Int. J. Biostat. 1:4
    [Google Scholar]
  75. Van der Vaart A. 1991. On differentiable functionals. Ann. Stat. 19:178–204
    [Google Scholar]
  76. Vansteelandt S, Joffe M. 2014. Structural nested models and G-estimation: the partially realized promise. Stat. Sci. 29:707–31
    [Google Scholar]
  77. Villar SS, Bowden J, Wason J 2015. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Stat. Sci. 30:199
    [Google Scholar]
  78. Watkins C, Dayan P. 1992. Q-learning. Mach. Learn. 8:279–92
    [Google Scholar]
  79. Xu X, Kypraios T, O'Neill PD 2016. Bayesian non-parametric inference for stochastic epidemic models using Gaussian processes. Biostatistics 17:619–33
    [Google Scholar]
  80. Zhang B, Tsiatis AA, Davidian M, Zhang M, Laber E 2012. Estimating optimal treatment regimes from a classification perspective. Statistics 1:103–14
    [Google Scholar]
  81. Zhang B, Tsiatis AA, Laber EB, Davidian M 2013. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika 100:681–94
    [Google Scholar]
  82. Zhang Y, Laber EB, Tsiatis A, Davidian M 2015. Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics 71:895–904
    [Google Scholar]
  83. Zhang Y, Laber EB, Tsiatis A, Davidian M 2016. Interpretable dynamic treatment regimes. arXiv:1606.01472 [stat.ME]
  84. Zhao Y, Kosorok M, Zeng D 2009. Reinforcement learning design for cancer clinical trials. Stat. Med. 28:3294–315
    [Google Scholar]
  85. Zhou X, Kosorok MR. 2017. Causal nearest neighbor rules for optimal treatment regimes. arXiv:1711.08451 [stat.ML]
/content/journals/10.1146/annurev-statistics-031219-041220
Loading
/content/journals/10.1146/annurev-statistics-031219-041220
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error