1932

Abstract

This article surveys reinforcement learning from the perspective of optimization and control, with a focus on continuous control applications. It reviews the general formulation, terminology, and typical experimental implementations of reinforcement learning as well as competing solution paradigms. In order to compare the relative merits of various techniques, it presents a case study of the linear quadratic regulator (LQR) with unknown dynamics, perhaps the simplest and best-studied problem in optimal control. It also describes how merging techniques from learning theory and control can provide nonasymptotic characterizations of LQR performance and shows that these characterizations tend to match experimental behavior. In turn, when revisiting more complex applications, many of the observed phenomena in LQR persist. In particular, theory and experiment demonstrate the role and importance of models and the cost of generality in reinforcement learning algorithms. The article concludes with a discussion of some of the challenges in designing learning systems that safely and reliably interact with complex and uncertain environments and how tools from reinforcement learning and control might be combined to approach these challenges.

Loading

Article metrics loading...

/content/journals/10.1146/annurev-control-053018-023825
2019-05-03
2024-04-16
Loading full text...

Full text loading...

/deliver/fulltext/control/2/1/annurev-control-053018-023825.html?itemId=/content/journals/10.1146/annurev-control-053018-023825&mimeType=html&fmt=ahah

Literature Cited

  1. 1.  Silver D, Huang A, Maddison CJ, Guez A, Sifre L et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529:484–89
    [Google Scholar]
  2. 2.  Bertsekas DP 2017. Dynamic Programming and Optimal Control Vol. 1. Nashua, NH: Athena Sci. 4th ed.
  3. 3.  Sutton RS, Barto AG 1998. Reinforcement Learning: An Introduction Cambridge, MA: MIT Press
  4. 4.  Puterman ML 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming Hoboken, NJ: Wiley-Interscience
  5. 5.  Dann C, Brunskill E 2015. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems 28 C Cortes, ND Lawrence, DD Lee, M Sugiyama, R Garnett2818–26 Red Hook, NY: Curran
    [Google Scholar]
  6. 6.  Nemirovski A, Yudin D 1983. Problem Complexity and Method Efficiency in Optimization New York: Wiley
  7. 7.  Zhu X 2005. Semi-supervised learning literature survey Tech. Rep. 1530, Dep. Comput. Sci., Univ. Wisc., Madison
  8. 8.  Hazan E, Kale S, Shalev-Shwartz S 2012. Near-optimal algorithms for online matrix prediction. Proceedings of the 25th Annual Conference on Learning Theory S Mannor, N Srebro, RC Williamson38.1–13 Proc. Mach. Learn. Res. 23. N.p.: PMLR
    [Google Scholar]
  9. 9.  Bertsekas DP, Tsitsiklis JN 1996. Neuro-Dynamic Programming Belmont, MA: Athena Sci.
  10. 10.  Kaelbling LP, Littman ML, Moore AW 1996. Reinforcement learning: a survey. J. Artif. Intell. Res. 4:237–85
    [Google Scholar]
  11. 11.  Bowling M, Burch N, Johanson M, Tammelin O 2015. Heads-up limit hold'em poker is solved. Science 347:145–49
    [Google Scholar]
  12. 12.  Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J et al. 2015. Human-level control through deep reinforcement learning. Nature 518:529–33
    [Google Scholar]
  13. 13.  Tesauro G 1995. TD-Gammon: a self-teaching backgammon program. Applications of Neural Networks AF Murray267–85 Boston, MA: Springer
    [Google Scholar]
  14. 14.  Bottou L, Peters J, Quiñonero-Candela J, Charles DX, Chickering DM et al. 2013. Counterfactual reasoning and learning systems: the example of computational advertising. J. Mach. Learn. Res. 14:3207–60
    [Google Scholar]
  15. 15.  Strehl A, Langford J, Li L, Kakade SM 2010. Learning from logged implicit exploration data. Advances in Neural Information Processing Systems 23 JD Lafferty, CKI Williams, J Shawe-Taylor, RS Zemel, A Culotta2217–25 Red Hook, NY: Curran
    [Google Scholar]
  16. 16.  Bertsekas DP 2012. Dynamic Programming and Optimal Control Vol. 2. Nashua, NH: Athena Sci. 4th ed.
  17. 17.  Ljung L 1998. System Identification: Theory for the User Upper Saddle River, NJ: Prentice Hall. 2nd ed.
  18. 18.  Campi MC, Weyer E 2002. Finite sample properties of system identification methods. IEEE Trans. Autom. Control 47:1329–34
    [Google Scholar]
  19. 19.  Vidyasagar M, Karandikar RL 2008. A learning theory approach to system identification and stochastic adaptive control. J. Process Control 18:421–30
    [Google Scholar]
  20. 20.  Tsitsiklis JN 1994. Asynchronous stochastic approximation and Q-learning. Mach. Learn. 16:185–202
    [Google Scholar]
  21. 21.  Watkins CJ, Dayan P 1992. Q-learning. Mach. Learn. 8:279–92
    [Google Scholar]
  22. 22.  Sutton RS 1988. Learning to predict by the method of temporal differences. Mach. Learn. 3:9–44
    [Google Scholar]
  23. 23.  Dayan P 1992. The convergence of TD(λ) for general λ. Mach. Learn. 8:341–62
    [Google Scholar]
  24. 24.  Bradtke SJ, Barto AG 1996. Linear least-squares algorithms for temporal difference learning. Mach. Learn. 22:33–57
    [Google Scholar]
  25. 25.  Bertsekas DP, Ioffe S 1996. Temporal differences-based policy iteration and applications in neuro-dynamic programming Tech. Rep. LIDS-P-2349, Lab. Inf. Decis. Syst., Mass. Inst. Technol., Cambridge, MA
  26. 26.  Yu H, Bertsekas DP 2009. Convergence results for some temporal difference methods based on least squares. IEEE Trans. Autom. Control 54:1515–31
    [Google Scholar]
  27. 27.  Williams RJ 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8:229–56
    [Google Scholar]
  28. 28.  Jamieson KG, Nowak RD, Recht B 2012. Query complexity of derivative-free optimization. Advances in Neural Information Processing Systems 25 F Pereira, CJC Burges, L Bottou, KQ Weinberger2672–80 Red Hook, NY: Curran
    [Google Scholar]
  29. 29.  Åström KJ 1965. Optimal control of Markov processes with incomplete state information. J. Math. Anal. Appl. 10:174–205
    [Google Scholar]
  30. 30.  Kaelbling LP, Littman ML, Cassandra AR 1998. Planning and acting in partially observable stochastic domains. Artif. Intell. 101:99–134
    [Google Scholar]
  31. 31.  Rastrigin LA 1963. About convergence of random search method in extremal control of multi-parameter systems. Avtomat. Telemekh. 24:1467–73
    [Google Scholar]
  32. 32.  Nesterov Y, Spokoiny V 2017. Random gradient-free minimization of convex functions. Found. Comput. Math. 17:527–66
    [Google Scholar]
  33. 33.  Beyer HG, Schwefel HP 2002. Evolution strategies: a comprehensive introduction. Nat. Comput. 1:3–52
    [Google Scholar]
  34. 34.  Schwefel HP 1975. Evolutionsstrategie und numerische optimierung PhD Thesis, Tech. Univ. Berlin
  35. 35.  Spall JC 1992. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans. Autom. Control 37:332–41
    [Google Scholar]
  36. 36.  Flaxman AD, Kalai AT, McMahan HB 2005. Online convex optimization in the bandit setting: gradient descent without a gradient. Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms385–94 Philadelphia: Soc. Ind. Appl. Math.
    [Google Scholar]
  37. 37.  Agarwal A, Dekel O, Xiao L 2010. Optimal algorithms for online convex optimization with multi-point bandit feedback. COLT 2010: 23rd Conference on Learning Theory AT Kalai, M Mohri28–40 Madison, WI: Omnipress
    [Google Scholar]
  38. 38.  Zhou K, Doyle JC, Glover K 1995. Robust and Optimal Control Upper Saddle River, NJ: Prentice Hall
  39. 39.  Dean S, Mania H, Matni N, Recht B, Tu S 2017. On the sample complexity of the linear quadratic regulator. arXiv:1710.01688 [math.OC]
  40. 40.  Simchowitz M, Mania H, Tu S, Jordan MI, Recht B 2018. Learning without mixing. Proceedings of the 31st Conference on Learning Theory S Bubeck, V Perchet, P Rigollet439–73 Proc. Mach. Learn. Res. 75. N.p.: PMLR
    [Google Scholar]
  41. 41.  Matni N, Wang YS, Anderson J 2017. Scalable system level synthesis for virtually localizable systems. 2017 IEEE 56th Annual Conference on Decision and Control3473–80 New York: IEEE
    [Google Scholar]
  42. 42.  Wang YS, Matni N, Doyle JC 2016. A system level approach to controller synthesis. arXiv:1610.04815 [cs.SY]
  43. 43.  Bradtke SJ, Ydstie BE, Barto AG 1994. Adaptive linear quadratic control using policy iteration. Proceedings of the 1994 American Control Conference3475–79 New York: IEEE
    [Google Scholar]
  44. 44.  Tu SL, Recht B 2018. Least-squares temporal difference learning for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning J Dy, A Krause5005–14 Proc. Mach. Learn. Res. 80. N.p.: PMLR
    [Google Scholar]
  45. 45.  Kingma D, Ba J 2014. Adam: a method for stochastic optimization. arXiv:1412.6980 [cs.LG]
  46. 46.  Dayan P 1991. Reinforcement comparison. Connectionist Models: Proceedings of the 1990 Summer School DS Touretzky, JL Elman, TJ Sejnowski, GE Hinton45–51 San Mateo, CA: Morgan Kaufmann
    [Google Scholar]
  47. 47.  Sutton RS 1984. Temporal credit assignment in reinforcement learning PhD Thesis, Univ. Mass., Amherst
  48. 48.  Williams RJ 1988. Toward a theory of reinforcement-learning connectionist systems Tech. Rep. NU-CCS-88-3, Coll. Comput. Sci., Northeast. Univ., Boston
  49. 49.  Lagoudakis MG, Parr R 2003. Least-squares policy iteration. J. Mach. Learn. Res. 4:1107–49
    [Google Scholar]
  50. 50.  Gao J 2014. Machine learning applications for data center optimization White Pap., Google, Mountain View, CA
  51. 51.  Efron B 1979. Bootstrap methods: another look at the jackknife. Ann. Stat. 7:1–26
    [Google Scholar]
  52. 52.  Shao J, Tu D 1995. The Jackknife and Bootstrap New York: Springer-Verlag
  53. 53.  Todorov E, Erez T, Tassa Y 2012. MuJoCo: a physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems5026–33 New York: IEEE
    [Google Scholar]
  54. 54.  Murray RM 2017. A Mathematical Introduction to Robotic Manipulation Boca Raton, FL: CRC
  55. 55.  Tedrake R, Zhang TW, Seung HS 2004. Stochastic policy gradient reinforcement learning on a simple 3D biped. 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems2849–54 New York: IEEE
    [Google Scholar]
  56. 56.  Levine S, Koltun V 2013. Guided policy search. Proceedings of the 30th International Conference on Machine Learning S Dasgupta, D McAllester1–9 Proc. Mach. Learn. Res. 28(3). N.p.: PMLR
    [Google Scholar]
  57. 57.  Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M 2014. Deterministic policy gradient algorithms. Proceedings of the 31st International Conference on Machine Learning EP Xing, T Jebara387–95 Proc. Mach. Learn. Res. 32(1). N.p.: PMLR
    [Google Scholar]
  58. 58.  Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T et al. 2015. Continuous control with deep reinforcement learning. arXiv:1509.02971 [cs.LG]
  59. 59.  Schulman J, Levine S, Abbeel P, Jordan M, Moritz P 2015. Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning F Bach, D Blei1889–97 Proc. Mach. Learn. Res. 37. N.p.: PMLR
    [Google Scholar]
  60. 60.  Schulman J, Moritz P, Levine S, Jordan M, Abbeel P 2015. High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438 [cs.LG]
  61. 61.  Wu Y, Mansimov E, Liao S, Grosse R, Ba J 2017. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. arXiv:1708.05144 [cs.LG]
  62. 62.  Salimans T, Ho J, Chen X, Sutskever I 2017. Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864 [stat.ML]
  63. 63.  Stulp F, Sigaud O 2013. Robot skill learning: from reinforcement learning to evolution strategies. Paladyn 4:49–61
    [Google Scholar]
  64. 64.  Rajeswaran A, Lowrey K, Todorov E, Kakade S 2017. Towards generalization and simplicity in continuous control. Advances in Neural Information Processing Systems 30 I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus et al.6550–61 Red Hook, NY: Curran
    [Google Scholar]
  65. 65.  Mania H, Guy A, Recht B 2018. Simple random search provides a competitive approach to reinforcement learning. arXiv:1803.07055 [cs.LG]
  66. 66.  Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D 2017. Deep reinforcement learning that matters. arXiv:1709.06560 [cs.LG]
  67. 67.  Islam R, Henderson P, Gomrokchi M, Precup D 2017. Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. arXiv:1708.04133 [cs.LG]
  68. 68.  Erez T, Lowrey K, Tassa Y, Kumar V, Kolev S, Todorov E 2013. An integrated system for real-time model predictive control of humanoid robots. 2013 13th IEEE-RAS International Conference on Humanoid Robots292–99 New York: IEEE
    [Google Scholar]
  69. 69.  Tassa Y, Erez T, Todorov E 2012. Synthesis and stabilization of complex behaviors through online trajectory optimization. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems4906–13 New York: IEEE
    [Google Scholar]
  70. 70.  Rosolia U, Borrelli F 2018. Learning model predictive control for iterative tasks. A data-driven control framework. IEEE Trans. Autom. Control 63:1883–96
    [Google Scholar]
  71. 71.  Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B et al. 2016. End to end learning for self-driving cars. arXiv:1604.07316 [cs.CV]
  72. 72.  Blondel VD, Tsitsiklis JN 2000. A survey of computational complexity results in systems and control. Automatica 36:1249–74
    [Google Scholar]
  73. 73.  Papadimitriou CH, Tsitsiklis JN 1987. The complexity of Markov decision processes. Math. Oper. Res. 12:441–50
    [Google Scholar]
  74. 74.  Levine S, Finn C, Darrell T, Abbeel P 2016. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17:1–40
    [Google Scholar]
  75. 75.  Akametalu AK, Fisac JF, Gillula JH, Kaynama S, Zeilinger MN, Tomlin CJ 2014. Reachability-based safe learning with Gaussian processes. 53rd IEEE Conference on Decision and Control1424–31 New York: IEEE
    [Google Scholar]
  76. 76.  Aswani A, Gonzalez H, Sastry SS, Tomlin CJ 2013. Provably safe and robust learning-based model predictive control. Automatica 49:1216–26
    [Google Scholar]
  77. 77.  Berkenkamp F, Turchetta M, Schoellig A, Krause A 2017. Safe model-based reinforcement learning with stability guarantees. Advances in Neural Information Processing Systems 30 I Guyon, UV Luxburg, S Bengio, H Wallach, R Fergus et al.908–18 Red Hook, NY: Curran
    [Google Scholar]
  78. 78.  Goodwin GC, Ramadge PJ, Caines PE 1981. Discrete time stochastic adaptive control. SIAM J. Control Optim. 19:829–53
    [Google Scholar]
  79. 79.  Abbasi-Yadkori Y, Szepesvári C 2011. Regret bounds for the adaptive control of linear quadratic systems. Proceedings of the 24th Annual Conference on Learning Theory SM Kakade, U von Luxburg1–26 Proc. Mach. Learn. Res. 19. N.p.: PMLR
    [Google Scholar]
  80. 80.  Auer P, Cesa-Bianchi N, Fischer P 2002. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47:235–56
    [Google Scholar]
  81. 81.  Lai TL, Robbins H 1985. Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6:4–22
    [Google Scholar]
  82. 82.  Abbasi-Yadkori Y, Szepesvári C 2015. Bayesian optimal control of smoothly parameterized systems: the lazy posterior sampling algorithm. Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence M Meila, T Heskes2–11 Arlington, VA: AUAI
    [Google Scholar]
  83. 83.  Abeille M, Lazaric A 2017. Thompson sampling for linear-quadratic control problems. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics A Singh, J Zhu176–84 Proc. Mach. Learn. Res. 54. N.p.: PMLR
    [Google Scholar]
  84. 84.  Ouyang Y, Gagrani M, Jain R 2017. Learning-based control of unknown linear systems with Thompson sampling. arXiv:1709.04047 [cs.SY]
  85. 85.  Abbasi-Yadkori Y, Lazic N, Szepesvári C 2018. The return of ϵ-greedy: sublinear regret for model-free linear quadratic control. arXiv:1804.06021 [cs.LG]
  86. 86.  Dean S, Mania H, Matni N, Recht B, Tu S 2018. Regret bounds for robust adaptive control of the linear quadratic regulator. arXiv:1805.09388 [cs.LG]
  87. 87.  Sadigh D, Sastry S, Seshia SA, Dragan AD 2016. Planning for autonomous cars that leverage effects on human actions. Robotics: Science and Systems XII D Hsu, N Amato, S Berman, S Jacobs chap. 29 N.p.: Robot. Sci. Syst. Found.
    [Google Scholar]
  88. 88.  Bialas WF 1989. Cooperative n-person Stackelberg games. Proceedings of the 28th IEEE Conference on Decision and Control 32439–44 New York: IEEE
    [Google Scholar]
  89. 89.  Li N, Oyler DW, Zhang M, Yildiz Y, Kolmanovsky I, Girard AR 2017. Game theoretic modeling of driver and vehicle interactions for verification and validation of autonomous vehicle control systems. IEEE Trans. Control Syst. Technol. 26:1782–97
    [Google Scholar]
  90. 90.  Kalman RE 1964. When is a linear control system optimal?. J. Basic Eng. 86:51–60
    [Google Scholar]
  91. 91.  Ng AY, Russell SJ 2000. Algorithms for inverse reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning P Langley663–70 San Francisco: Morgan Kaufmann
    [Google Scholar]
/content/journals/10.1146/annurev-control-053018-023825
Loading
/content/journals/10.1146/annurev-control-053018-023825
Loading

Data & Media loading...

  • Article Type: Review Article
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error