Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies

Bin Hu; Kaiqing Zhang; Na Li; Mehran Mesbahi; Maryam Fazel; Tamer Başar

doi:10.1146/annurev-control-042920-020021

Annual Review of Control, Robotics, and Autonomous Systems

Volume 6, 2023

Review Article

Open Access

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies

Bin Hu¹, Kaiqing Zhang^2,3, Na Li⁴, Mehran Mesbahi⁵, Maryam Fazel⁶, and Tamer Başar¹
View Affiliations Hide Affiliations

Affiliations: ¹Coordinated Science Laboratory and Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, Illinois, USA; email: binhu7@illinois.edu basar1@illinois.edu ²Laboratory for Information and Decision Systems and Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA ³Current affiliation: Department of Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, College Park, Maryland, USA; email: kaiqing@umd.edu ⁴School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, USA; email: nali@seas.harvard.edu ⁵Department of Aeronautics and Astronautics, University of Washington, Seattle, Washington, USA; email: mesbahi@uw.edu ⁶Department of Electrical and Computer Engineering, University of Washington, Seattle, Washington, USA; email: mfazel@uw.edu
Vol. 6:123-158 (Volume publication date May 2023) https://doi.org/10.1146/annurev-control-042920-020021
Copyright © 2023 by the author(s).

This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. See credit lines of images or other third-party material in this article for license information

Abstract

Gradient-based methods have been widely used for system design and optimization in diverse application domains. Recently, there has been a renewed interest in studying theoretical properties of these methods in the context of control and reinforcement learning. This article surveys some of the recent developments on policy optimization, a gradient-based iterative approach for feedback control synthesis that has been popularized by successes of reinforcement learning. We take an interdisciplinary perspective in our exposition that connects control theory, reinforcement learning, and large-scale optimization. We review a number of recently developed theoretical results on the optimization landscape, global convergence, and sample complexityof gradient-based methods for various continuous control problems, such as the linear quadratic regulator (LQR), control, risk-sensitive control, linear quadratic Gaussian (LQG) control, and output feedback synthesis. In conjunction with these optimization results, we also discuss how direct policy optimization handles stability and robustness concerns in learning-based control, two main desiderata in control engineering. We conclude the survey by pointing out several challenges and opportunities at the intersection of learning and control.

Keyword(s): feedback control synthesis, policy optimization, reinforcement learning

Article metrics loading...

/content/journals/10.1146/annurev-control-042920-020021

2023-05-03

2025-04-03

The full text of this item is not currently available.

Literature Cited

1.
Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J et al. 2015. Human-level control through deep reinforcement learning. Nature 518:529–33
[Google Scholar]
2.
Vinyals O, Babuschkin I, Chung J, Mathieu M, Jaderberg M et al. 2019. AlphaStar: mastering the real-time strategy game StarCraft II. DeepMind, Jan. 24. https://www.deepmind.com/blog/alphastar-mastering-the-real-time-strategy-game-starcraft-ii
[Google Scholar]
3.
Silver D, Huang A, Maddison CJ, Guez A, Sifre L et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529:484–89
[Google Scholar]
4.
Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A et al. 2017. Mastering the game of Go without human knowledge. Nature 550:354–59
[Google Scholar]
5.
Rajeswaran A, Kumar V, Gupta A, Vezzani G, Schulman J et al. 2017. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv:1709.10087 [cs.LG]
6.
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T et al. 2015. Continuous control with deep reinforcement learning. arXiv:1509.02971 [cs.LG]
7.
Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438 [cs.LG]
8.
Sutton RS, McAllester DA, Singh SP, Mansour Y 2000. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems 12 S Solla, T Leen, K Müller 1057–63. Cambridge, MA: MIT Press
[Google Scholar]
9.
Konda VR, Tsitsiklis JN 2000. Actor-critic algorithms. Advances in Neural Information Processing Systems 12 S Solla, T Leen, K Müller 1008–14. Cambridge, MA: MIT Press
[Google Scholar]
10.
Schulman J, Levine S, Abbeel P, Jordan M, Moritz P 2015. Trust region policy optimization. Proceedings of the 32nd International Conference on Machine Learning F Bach, D Blei 1889–97. Proc. Mach. Learn. Res. 37 N.p.: PMLR
[Google Scholar]
11.
Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. 2017. Proximal policy optimization algorithms. arXiv:1707.06347 [cs.LG]
12.
Lee AX, Nagabandi A, Abbeel P, Levine S. 2019. Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. arXiv:1907.00953 [cs.LG]
13.
Yarats D, Zhang A, Kostrikov I, Amos B, Pineau J, Fergus R 2021. Improving sample efficiency in model-free reinforcement learning from images. Proc. AAAI Conf. Artif. Intell. 35:10674–81
[Google Scholar]
14.
Yarats D, Fergus R, Lazaric A, Pinto L. 2022. Mastering visual continuous control: improved data-augmented reinforcement learning. The Tenth International Conference on Learning Representations La Jolla, CA: Int. Conf. Learn. Represent. https://openreview.net/forum?id=_SJ-_yyes8
[Google Scholar]
15.
Draper CS, Li YT. 1951. Principles of Optimalizing Control Systems and an Application to the Internal Combustion Engine New York: Am. Soc. Mech. Eng.
[Google Scholar]
16.
Whitaker HP, Yamron J, Kezer A. 1958. Design of model-reference adaptive control systems for aircraft Rep., Instrum. Lab. Mass. Inst. Technol. Cambridge:
[Google Scholar]
17.
Kalman RE. 1960. Contributions to the theory of optimal control. Bol. Soc. Mat. Mex. 5:102–19
[Google Scholar]
18.
Talkin A. 1961. Adaptive servo tracking. IRE Trans. Autom. Control 6:167–72
[Google Scholar]
19.
Levine W, Athans M. 1970. On the determination of the optimal constant output feedback gains for linear multivariable systems. IEEE Trans. Autom. Control 15:44–48
[Google Scholar]
20.
Makila P, Toivonen H. 1987. Computational methods for parametric LQ problems—a survey. IEEE Trans. Autom. Control 32:658–71
[Google Scholar]
21.
Boyd S, Vandenberghe L. 2004. Convex Optimization Cambridge, UK: Cambridge Univ. Press
[Google Scholar]
22.
Boyd S, El Ghaoui L, Feron E, Balakrishnan V 1994. Linear Matrix Inequalities in System and Control Theory Philadelphia: Soc. Ind. Appl. Math.
[Google Scholar]
23.
Gahinet P, Apkarian P. 1994. A linear matrix inequality approach to H_∞ control. Int. J. Robust Nonlinear Control 4:421–48
[Google Scholar]
24.
Scherer C, Wieland S. 2004. Linear matrix inequalities in control Lect. Notes, Dutch Inst. Syst. Control, Delft Univ. Technol. Delft, Neth:.
[Google Scholar]
25.
Papachristodoulou A, Anderson J, Valmorbida G, Prajna S, Seiler P et al. 2022. SOSTOOLS: sum of squares optimization toolbox for MATLAB. University of Oxford http://sysos.eng.ox.ac.uk/sostools
[Google Scholar]
26.
Anderson J, Papachristodoulou A. 2015. Advances in computational Lyapunov analysis using sum-of-squares programming. Discrete Contin. Dyn. Syst. B 20:2361–81
[Google Scholar]
27.
Rautert T, Sachs EW. 1997. Computational design of optimal output feedback controllers. SIAM J. Optim. 7:837–52
[Google Scholar]
28.
Apkarian P, Noll D. 2006. Nonsmooth H_∞ synthesis. IEEE Trans. Autom. Control 51:71–86
[Google Scholar]
29.
Apkarian P, Noll D, Rondepierre A. 2008. Mixed H₂/H_∞ control via nonsmooth optimization. SIAM J. Control Optim. 47:1516–46
[Google Scholar]
30.
Noll D, Apkarian P. 2005. Spectral bundle methods for non-convex maximum eigenvalue functions: second-order methods. Math. Program. 104:729–47
[Google Scholar]
31.
Saeki M. 2006. Static output feedback design for H_∞ control by descent method. Proceedings of the 45th IEEE Conference on Decision and Control5156–61. Piscataway, NJ: IEEE
[Google Scholar]
32.
Gumussoy S, Henrion D, Millstone M, Overton ML. 2009. Multiobjective robust control with HIFOO 2.0. IFAC Proc. Vol 42:6144–49
[Google Scholar]
33.
Arzelier D, Deaconu G, Gumussoy S, Henrion D. 2011. H2 for HIFOO Paper presented at the 3rd International Conference on Control and Optimization with Industrial Applications Ankara, Turkey: Aug. 22–24
[Google Scholar]
34.
Mårtensson K, Rantzer A. 2009. Gradient methods for iterative distributed control synthesis. Proceedings of the 48th IEEE Conference on Decision and Control Held Jointly with 2009 28th Chinese Control Conference549–54. Piscataway, NJ: IEEE
[Google Scholar]
35.
Fazel M, Ge R, Kakade S, Mesbahi M 2018. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning J Dy, A Krase 1467–76. Proc. Mach. Learn. Res. 80 N.p.: PMLR
[Google Scholar]
36.
Bu J, Mesbahi A, Fazel M, Mesbahi M. 2019. LQR through the lens of first order methods: discrete-time case. arXiv:1907.08921 [eess.SY]
37.
Malik D, Pananjady A, Bhatia K, Khamaru K, Bartlett P, Wainwright M 2019. Derivative-free methods for policy optimization: guarantees for linear quadratic systems. Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics K Chaudhuri, M Sugiyama 2916–25. Proc. Mach. Learn. Res. 89 N.p.: PMLR
[Google Scholar]
38.
Mohammadi H, Zare A, Soltanolkotabi M, Jovanović MR. 2021. Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem. IEEE Trans. Autom. Control 67:2435–50
[Google Scholar]
39.
Furieri L, Zheng Y, Kamgarpour M 2020. Learning the globally optimal distributed LQ regulator. Proceedings of the 2nd Conference on Learning for Dynamics and Control AM Bayen, A Jadbabaie, G Pappas, PA Parrilo, B Recht, et al. 287–97. Proc. Mach. Learn. Res. 120 N.p.: PMLR
[Google Scholar]
40.
Li Y, Tang Y, Zhang R, Li N. 2022. Distributed reinforcement learning for decentralized linear quadratic control: a derivative-free policy optimization approach. IEEE Trans. Autom. Control 67:6429–44
[Google Scholar]
41.
Hambly B, Xu R, Yang H. 2021. Policy gradient methods for the noisy linear quadratic regulator over a finite horizon. SIAM J. Control Optim. 59:3359–91
[Google Scholar]
42.
Yang Z, Chen Y, Hong M, Wang Z 2019. Provably global convergence of actor-critic: a case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems 32 H Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, E Fox, R Garnett 8231–33. Red Hook, NY: Curran
[Google Scholar]
43.
Jin Z, Schmitt JM, Wen Z. 2020. On the analysis of model-free methods for the linear quadratic regulator. arXiv:2007.03861 [math.OC]
44.
Mohammadi H, Soltanolkotabi M, Jovanović MR. 2020. On the linear convergence of random search for discrete-time LQR. IEEE Control Syst. Lett. 5:989–94
[Google Scholar]
45.
Perdomo J, Umenberger J, Simchowitz M 2021. Stabilizing dynamical systems via policy gradient methods. In Advances in Neural Information Processing Systems 34 M Ranzato, A Beygelzimer, Y Dauphin, PS Liang, J Wortman Vaughan 29274–86. Red Hook, NY: Curran
[Google Scholar]
46.
Ozaslan IK, Mohammadi H, Jovanović MR. 2022. Computing stabilizing feedback gains via a model-free policy gradient method. IEEE Control Syst. Lett. 7:407–12
[Google Scholar]
47.
Zhao F, Fu X, You K. 2022. On the sample complexity of stabilizing linear systems via policy gradient methods. arXiv:2205.14335 [math.OC]
48.
Zhang K, Hu B, Başar T. 2021. Policy optimization for H₂ linear control with H_∞ robustness guarantee: implicit regularization and global convergence. SIAM J. Control Optim. 59:4081–109
[Google Scholar]
49.
Zhang K, Hu B, Başar T 2020. On the stability and convergence of robust adversarial reinforcement learning: a case study on linear quadratic systems. Advances in Neural Information Processing Systems 33 H Larochelle, M Ranzato, R Hadsell, MF Balcan, H Lin 22056–68. Red Hook, NY: Curran
[Google Scholar]
50.
Gravell B, Esfahani PM, Summers T. 2020. Learning optimal controllers for linear systems with multiplicative noise via policy gradient. IEEE Trans. Autom. Control 66:5283–98
[Google Scholar]
51.
Zhang K, Zhang X, Hu B, Başar T 2021. Derivative-free policy optimization for linear risk-sensitive and robust control design: implicit regularization and sample complexity. Advances in Neural Information Processing Systems 34 M Ranzato, A Beygelzimer, Y Dauphin, PS Liang, J Wortman Vaughan 2949–64. Red Hook, NY: Curran
[Google Scholar]
52.
Zhao F, You K 2021. Primal-dual learning for the model-free risk-constrained linear quadratic regulator. Proceedings of the 3rd Conference on Learning for Dynamics and Control A Jadbabaie, J Lygeros, GJ Pappas, PA Parrilo, B Recht, et al. 702–14. Proc. Mach. Learn. Res. 144 N.p.: PMLR
[Google Scholar]
53.
Zhang Y, Yang Z, Wang Z 2021. Provably efficient actor-critic for risk-sensitive and robust adversarial RL: a linear-quadratic case. Proceedings of the 24th International Conference on Artificial Intelligence and Statistics A Banerjee, K Fukumizu 2764–72. Proc. Mach. Learn. Res. 130 N.p.: PMLR
[Google Scholar]
54.
Guo X, Hu B. 2022. Global convergence of direct policy search for state-feedback H_∞robust control: a revisit of nonsmooth synthesis with Goldstein subdifferential Paper presented at the 36th Conference on Neural Information Processing Systems New Orleans, LA: Nov. 28–Dec. 9
[Google Scholar]
55.
Keivan D, Havens A, Seiler P, Dullerud G, Hu B. 2022. Model-free μ synthesis via adversarial reinforcement learning. 2022 American Control Conference3335–41. Piscataway, NJ: IEEE
[Google Scholar]
56.
Jansch-Porto JP, Hu B, Dullerud GE. 2020. Convergence guarantees of policy optimization methods for Markovian jump linear systems. 2020 American Control Conference2882–87. Piscataway, NJ: IEEE
[Google Scholar]
57.
Jansch-Porto JP, Hu B, Dullerud G 2020. Policy learning of MDPs with mixed continuous/discrete variables: a case study on model-free control of Markovian jump systems. Proceedings of the 2nd Conference on Learning for Dynamics and Control AM Bayen, A Jadbabaie, G Pappas, PA Parrilo, B Recht, et al. 947–957. Proc. Mach. Learn. Res. 120 N.p.: PMLR
[Google Scholar]
58.
Rathod S, Bhadu M, De A. 2021. Global convergence using policy gradient methods for model-free Markovian jump linear quadratic control. arXiv:2111.15228 [cs.LG]
59.
Jansch-Porto JP, Hu B, Dullerud GE. 2022. Policy optimization for Markovian jump linear quadratic control: gradient method and global convergence. IEEE Trans. Autom. Control In press. https://doi.org/10.1109/TAC.2022.3176439
[Google Scholar]
60.
Qu G, Yu C, Low S, Wierman A. 2021. Exploiting linear models for model-free nonlinear control: a provably convergent policy gradient approach. 2021 60th IEEE Conference on Decision and Control6539–46. Piscataway, NJ: IEEE
[Google Scholar]
61.
Feng H, Lavaei J. 2019. On the exponential number of connected components for the feasible set of optimal decentralized control problems. 2019 American Control Conference1430–37. Piscataway, NJ: IEEE
[Google Scholar]
62.
Fatkhullin I, Polyak B. 2021. Optimizing static linear feedback: gradient method. SIAM J. Control Optim. 59:3887–911
[Google Scholar]
63.
Zheng Y, Tang Y, Li N. 2021. Analysis of the optimization landscape of linear quadratic Gaussian (LQG) control. arXiv:2102.04393 [math.OC]
64.
Duan J, Li J, Zhao L. 2021. Optimization landscape of gradient descent for discrete-time static output feedback. arXiv:2109.13132 [math.OC]
65.
Duan J, Cao W, Zheng Y, Zhao L. 2022. On the optimization landscape of dynamical output feedback linear quadratic control. arXiv:2201.09598 [math.OC]
66.
Mohammadi H, Soltanolkotabi M, Jovanović MR. 2021. On the lack of gradient domination for linear quadratic Gaussian problems with incomplete state information. 2021 60th IEEE Conference on Decision and Control1120–24. Piscataway, NJ: IEEE
[Google Scholar]
67.
Hu B, Zheng Y. 2022. Connectivity of the feasible and sublevel sets of dynamic output feedback control with robustness constraints. IEEE Control Syst. Lett. 7:442–47
[Google Scholar]
68.
Umenberger J, Simchowitz M, Perdomo JC, Zhang K, Tedrake R. 2022. Globally convergent policy search over dynamic filters for output estimation. arXiv:2202.11659 [math.OC]
69.
Buşoniu L, de Bruin T, Tolić D, Kober J, Palunko I. 2018. Reinforcement learning for control: performance, stability, and deep approximators. Annu. Rev. Control 46:8–28
[Google Scholar]
70.
Recht B. 2019. A tour of reinforcement learning: the view from continuous control. Annu. Rev. Control Robot. Auton. Syst. 2:253–79
[Google Scholar]
71.
Matni N, Proutiere A, Rantzer A, Tu S. 2019. From self-tuning regulators to reinforcement learning and back again. 2019 IEEE 58th Conference on Decision and Control3724–40. Piscataway, NJ: IEEE
[Google Scholar]
72.
Jacobson D. 1973. Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games. IEEE Trans. Autom. Control 18:124–31
[Google Scholar]
73.
Zhou K, Doyle JC, Glover K. 1996. Robust and Optimal Control Upper Saddle River, NJ: Prentice Hall
[Google Scholar]
74.
Mustafa D. 1989. Relations between maximum-entropy/H_∞ control and combined H_∞ /LQG control. Syst. Control Lett. 12:193–203
[Google Scholar]
75.
Mustafa D, Bernstein DS. 1991. LQG cost bounds in discrete-time H₂/H_∞ control. Trans. Inst. Meas. Control 13:269–75
[Google Scholar]
76.
Peres PL, Geromel JC. 1994. An alternate numerical solution to the linear quadratic problem. IEEE Trans. Autom. Control 39:198–202
[Google Scholar]
77.
Nesterov Y, Polyak BT. 2006. Cubic regularization of Newton method and its global performance. Math. Program. 108:177–205
[Google Scholar]
78.
Karimi H, Nutini J, Schmidt M 2016. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. Machine Learning and Knowledge Discovery in Databases P Frasconi, N Landwehr, G Manco, J Vreeken 795–811. Cham, Switz.: Springer
[Google Scholar]
79.
Li G, Pong TK. 2018. Calculus of the exponent of Kurdyka–Lojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math. 18:1199–232
[Google Scholar]
80.
Bauschke HH, Combettes PL. 2011. Convex Analysis and Monotone Operator Theory in Hilbert Spaces Cham, Switz.: Springer
[Google Scholar]
81.
Mohammadi H, Zare A, Soltanolkotabi M, Jovanović MR. 2019. Global exponential convergence of gradient methods over the nonconvex landscape of the linear quadratic regulator. 2019 58th IEEE Conference on Decision and Control7474–79. Piscataway, NJ: IEEE
[Google Scholar]
82.
Bu J, Mesbahi M. 2020. Global convergence of policy gradient algorithms for indefinite least squares stationary optimal control. IEEE Control Syst. Lett. 4:638–43
[Google Scholar]
83.
Lamperski A. 2020. Computing stabilizing linear controllers via policy iteration. 2020 59th IEEE Conference on Decision and Control1902–07. Piscataway, NJ: IEEE
[Google Scholar]
84.
Nesterov Y, Spokoiny V. 2017. Random gradient-free minimization of convex functions. Found. Comput. Math. 17:527–66
[Google Scholar]
85.
Duchi JC, Jordan MI, Wainwright MJ, Wibisono A. 2015. Optimal rates for zero-order convex optimization: the power of two function evaluations. IEEE Trans. Inf. Theory 61:2788–806
[Google Scholar]
86.
Shamir O 2013. On the complexity of bandit and derivative-free stochastic convex optimization. Proceedings of the 26th Annual Conference on Learning Theory S Shalev-Shwartz, I Steinwart 3–24. Proc. Mach. Learn. Res. 30 N.p.: PMLR
[Google Scholar]
87.
Ghadimi S, Lan G. 2013. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23:2341–68
[Google Scholar]
88.
Balasubramanian K, Ghadimi S. 2022. Zeroth-order nonconvex stochastic optimization: handling constraints, high dimensionality, and saddle points. Found. Comput. Math. 22:35–76
[Google Scholar]
89.
Tang Y, Ren Z, Li N. 2020. Zeroth-order feedback optimization for cooperative multi-agent systems. 2020 59th IEEE Conference on Decision and Control3649–56. Piscataway, NJ: IEEE
[Google Scholar]
90.
Talebi S, Alemzadeh S, Rahimi N, Mesbahi M. 2021. On regularizability and its application to online control of unstable LTI systems. IEEE Trans. Autom. Control 67:6413–28
[Google Scholar]
91.
Ziemann I, Tsiamis A, Sandberg H, Matni N. 2022. How are policy gradient methods affected by the limits of control?. arXiv:2206.06863 [math.OC]
92.
Tsiamis A, Ziemann I, Matni N, Pappas GJ. 2022. Statistical learning theory for control: a finite sample perspective. arXiv:2209.05423 [eess.SY]
93.
Hjalmarsson H, Gevers M, Gunnarsson S, Lequin O. 1998. Iterative feedback tuning: theory and applications. IEEE Control Syst. Mag. 18:426–41
[Google Scholar]
94.
Hjalmarsson H. 2002. Iterative feedback tuning—an overview. Int. J. Adapt. Control Signal Process. 16:373–95
[Google Scholar]
95.
Kakade SM 2002. A natural policy gradient. Advances in Neural Information Processing Systems 14 T Dietterich, S Becker, Z Ghahramani 1531–38. Cambridge, MA: MIT Press
[Google Scholar]
96.
Bradtke SJ, Ydstie BE, Barto AG. 1994. Adaptive linear quadratic control using policy iteration. Proceedings of the 1994 American Control Conference, Vol. 33475–79. Piscataway, NJ: IEEE
[Google Scholar]
97.
Kleinman D. 1968. On an iterative technique for Riccati equation computations. IEEE Trans. Autom. Control 13:114–15
[Google Scholar]
98.
Hewer G. 1971. An iterative technique for the computation of the steady state gains for the discrete optimal regulator. IEEE Trans. Autom. Control 16:382–84
[Google Scholar]
99.
Lagoudakis MG, Parr R. 2003. Least-squares policy iteration. J. Mach. Learn. Res. 4:1107–49
[Google Scholar]
100.
Krauth K, Tu S, Recht B 2019. Finite-time analysis of approximate policy iteration for the linear quadratic regulator. Advances in Neural Information Processing Systems 32 H Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, E Fox, R Garnett 8514–24. Red Hook, NY: Curran
[Google Scholar]
101.
Bu J, Mesbahi A, Mesbahi M. 2020. Policy gradient-based algorithms for continuous-time linear quadratic control. arXiv:2006.09178 [eess.SY]
102.
Bertsekas DP. 1997. Nonlinear programming. J. Oper. Res. Soc. 48:334
[Google Scholar]
103.
Burke JV, Curtis FE, Lewis AS, Overton ML, Simões LE 2020. Gradient sampling methods for nonsmooth optimization. Numerical Nonsmooth Optimization A Bagirov, M Gaudioso, N Karmitsa, M Mäkelä, S Taheri 202–25. Cham, Switz.: Springer
[Google Scholar]
104.
Wu HN, Luo B. 2012. Neural network based online simultaneous policy update algorithm for solving the HJI equation in nonlinear H_∞ control. IEEE Trans. Neural Netw. Learn. Syst. 23:1884–95
[Google Scholar]
105.
Luo B, Wu HN, Huang T. 2014. Off-policy reinforcement learning for H_∞ control design. IEEE Trans. Cybernet. 45:65–76
[Google Scholar]
106.
Kiumarsi B, Lewis FL, Jiang ZP. 2017. H_∞ control of linear discrete-time systems: off-policy reinforcement learning. Automatica 78:144–52
[Google Scholar]
107.
Kubo M, Banno R, Manabe H, Minoji M. 2019. Implicit regularization in over-parameterized neural networks. arXiv:1903.01997 [cs.LG]
108.
Ma C, Wang K, Chi Y, Chen Y 2017. Implicit regularization in nonconvex statistical estimation: gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. arXiv:1711.10467 [cs.LG]
109.
Chen Y, Wainwright MJ. 2015. Fast low-rank estimation by projected gradient descent: general statistical and algorithmic guarantees. arXiv:1509.03025 [math.ST]
110.
Zheng Q, Lafferty J. 2016. Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent. arXiv:1605.07051 [stat.ML]
111.
Başar T, Bernhard P. 1995. H^∞-Optimal Control and Related Minimax Design Problems Boston: Birkhäuser. , 2nd ed..
[Google Scholar]
112.
Rantzer A. 1996. On the Kalman–Yakubovich–Popov lemma. Syst. Control Lett. 28:7–10
[Google Scholar]
113.
Al-Tamimi A, Lewis FL, Abu-Khalaf M. 2007. Model-free Q-learning designs for linear discrete-time zero-sum games with application to H-infinity control. Automatica 43:473–81
[Google Scholar]
114.
Zhang K, Yang Z, Başar T 2019. Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. Advances in Neural Information Processing Systems 32 H Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, E Fox, R Garnett 11570–82. Red Hook, NY: Curran
[Google Scholar]
115.
Gravell B, Ganapathy K, Summers T. 2020. Policy iteration for linear quadratic games with stochastic parameters. IEEE Control Syst. Lett. 5:307–12
[Google Scholar]
116.
Zhang J, Yang Z, Zhou Z, Wang Z 2021. Provably sample efficient reinforcement learning in competitive linear quadratic systems. Proceedings of the 3rd Conference on Learning for Dynamics and Control A Jadbabaie, J Lygeros, GJ Pappas, PA Parrilo, B Recht, et al. 597–98. Proc. Mach. Learn. Res. 144 N.p.: PMLR
[Google Scholar]
117.
Zhang K, Yang Z, Başar T 2021. Multi-agent reinforcement learning: a selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control KG Vamvoudakis, Y Wan, FL Lewis, D Cansever 321–84. Cham, Switz: Springer
[Google Scholar]
118.
Başar T, Olsder G. 1999. Dynamic Noncooperative Game Theory Philadelphia: Soc. Ind. Appl. Math., 2nd ed..
[Google Scholar]
119.
Bu J, Ratliff LJ, Mesbahi M. 2019. Global convergence of policy gradient for sequential zero-sum linear quadratic dynamic games. arXiv:1911.04672 [eess.SY]
120.
Morimoto J, Doya K. 2005. Robust reinforcement learning. Neural Comput. 17:335–59
[Google Scholar]
121.
Pinto L, Davidson J, Sukthankar R, Gupta A 2017. Robust adversarial reinforcement learning. Proceedings of the 34th International Conference on Machine Learning D Precup, YW Teh 2817–26. Proc. Mach. Learn. Res. 70 N.p.: PMLR
[Google Scholar]
122.
Dullerud G, Paganini F. 1999. A Course in Robust Control Theory: A Convex Approach New York: Springer
[Google Scholar]
123.
Goldstein A. 1977. Optimization of Lipschitz continuous functions. Math. Program. 13:14–22
[Google Scholar]
124.
Turchetta M, Krause A, Trimpe S. 2020. Robust model-free reinforcement learning with multi-objective Bayesian optimization. 2020 IEEE International Conference on Robotics and Automation10702–8. Piscataway, NJ: IEEE
[Google Scholar]
125.
Pang B, Jiang ZP. 2021. Robust reinforcement learning: a case study in linear quadratic regulation. Proc. AAAI Conf. Artif. Intell. 35:9303–11
[Google Scholar]
126.
Pang B, Bian T, Jiang ZP. 2021. Robust policy iteration for continuous-time linear quadratic regulation. IEEE Trans. Autom. Control 67:504–11
[Google Scholar]
127.
Venkataraman HK, Seiler PJ. 2019. Recovering robustness in model-free reinforcement learning. 2019 American Control Conference4210–16. Piscataway, NJ: IEEE
[Google Scholar]
128.
Zhao F, You K, Başar T. 2021. Infinite-horizon risk-constrained linear quadratic regulator with average cost. 2021 60th IEEE Conference on Decision and Control390–95. Piscataway, NJ: IEEE
[Google Scholar]
129.
Zhao F, You K, Başar T. 2021. Global convergence of policy gradient primal-dual methods for risk-constrained LQRs. arXiv:2104.04901 [math.OC]
130.
Zheng Y, Sun Y, Fazel M, Li N. 2022. Escaping high-order saddles in policy optimization for linear quadratic Gaussian (LQG) control. arXiv:2204.00912 [math.OC]
131.
Bu J, Mesbahi A, Mesbahi M. 2021. On topological properties of the set of stabilizing feedback gains. IEEE Trans. Autom. Control 66730–44
[Google Scholar]
132.
Talebi S, Mesbahi M. 2022. Policy optimization over submanifolds for constrained feedback synthesis. IEEE Trans. Autom. Control. In press
[Google Scholar]
133.
Ding Y, Feng H, Lavaei J. 2019. Aggressive local search for constrained optimal control problems with many local minima. arXiv:1903.08634 [math.OC]
134.
Sun Y, Fazel M. 2021. Learning optimal controllers by policy gradient: global optimality via convex parameterization. 2021 60th IEEE Conference on Decision and Control4576–81. Piscataway, NJ: IEEE
[Google Scholar]
135.
Scherer C, Gahinet P, Chilali M. 1997. Multiobjective output-feedback control via LMI optimization. IEEE Trans. Autom. Control 42:896–911
[Google Scholar]
136.
Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI. 2017. How to escape saddle points efficiently. Proceedings of the 34th International Conference on Machine Learning D Precup, YW Teh 1724–32. Proc. Mach. Learn. Res. 70 N.p.: PMLR
[Google Scholar]
137.
Sun Y, Flammarion N, Fazel M 2019. Escaping from saddle points on Riemannian manifolds. Advances in Neural Information Processing Systems 32 H Wallach, H Larochelle, A Beygelzimer, F d'Alché-Buc, E Fox, R Garnett 7244–54. Red Hook, NY: Curran
[Google Scholar]
138.
Ren Z, Tang Y, Li N. 2022. Escaping saddle points in zeroth-order optimization: two function evaluations suffice. arXiv:2209.13555 [math.OC]
139.
van der Schaft A 2000. L₂-Gain and Passivity Techniques in Nonlinear Control London: Springer
[Google Scholar]
140.
Willems J. 1972. Dissipative dynamical systems part I: general theory. Arch. Ration. Mech. Anal. 45:321–51
[Google Scholar]
141.
Megretski A, Rantzer A. 1997. System analysis via integral quadratic constraints. IEEE Trans. Autom. Control 42:819–30
[Google Scholar]
142.
Sandell N, Varaiya P, Athans M. 1975. A survey of decentralized control methods for large scale systems. Systems Engineering for Power: Status and Prospects334–35. Washington, DC: US Energy Res. Dev. Adm.
[Google Scholar]
143.
Tsitsiklis JN. 1984. Problems in decentralized decision making and computation PhD Thesis Mass. Inst. Technol. Cambridge:
[Google Scholar]
144.
Rotkowitz M, Lall S. 2005. A characterization of convex problems in decentralized control. IEEE Trans. Autom. Control 50:1984–96
[Google Scholar]
145.
Mazumdar E, Ratliff LJ, Jordan MI, Sastry SS. 2019. Policy-gradient algorithms have no guarantees of convergence in linear quadratic games. arXiv:1907.03712 [cs.LG]
146.
Fu Z, Yang Z, Chen Y, Wang Z 2019. Actor-critic provably finds Nash equilibria of linear-quadratic mean-field games. The Eighth International Conference on Learning Representations La Jolla, CA: Int. Conf. Learn. Represent https://openreview.net/forum?id=H1lhqpEYPr
[Google Scholar]
147.
Carmona R, Laurière M, Tan Z. 2019. Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods. arXiv:1910.04295 [math.OC]
148.
Wang W, Han J, Yang Z, Wang Z 2021. Global convergence of policy gradient for linear-quadratic mean-field control/game in continuous time. Proceedings of the 38th International Conference on Machine Learning M Meila, T Zhang 10772–82. Proc. Mach. Learn. Res. 139 N.p.: PMLR
[Google Scholar]
149.
Carmona R, Hamidouche K, Laurière M, Tan Z. 2020. Policy optimization for linear-quadratic zero-sum mean-field type games. 2020 Conference on Decision and Control1038–43. Piscataway, NJ: IEEE
[Google Scholar]
150.
Dean S, Mania H, Matni N, Recht B, Tu S. 2017. On the sample complexity of the linear quadratic regulator. Found. Comput. Math. 20:633–79
[Google Scholar]
151.
Tu S, Recht B 2019. The gap between model-based and model-free methods on the linear quadratic regulator: an asymptotic viewpoint. Proceedings of the Thirty-Second Conference on Learning Theory A Beygelzimer, D Hsu 3036–83. Proc. Mach. Learn. Res. 99 N.p.: PMLR
[Google Scholar]
152.
Lale S, Azizzadenesheli K, Hassibi B, Anandkumar A. 2020. Explore more and improve regret in linear quadratic regulators. arXiv:2007.12291 [cs.LG]
153.
Chen X, Hazan E 2021. Black-box control for linear dynamical systems. Proceedings of Thirty Fourth Conference on Learning Theory M Belkin, S Kpotufe 1114–43. Proc. Mach. Learn. Res. 134 N.p.: PMLR
[Google Scholar]
154.
Simchowitz M, Foster D 2020. Naive exploration is optimal for online LQR. Proceedings of the 37th International Conference on Machine Learning H Daumé III, A Singh 8937–48. Proc. Mach. Learn. Res. 119 N.p.: PMLR
[Google Scholar]
155.
Simchowitz M, Singh K, Hazan E 2020. Improper learning for non-stochastic control. Proceedings of 33rd Conference on Learning Theory J Abernethy, S Agarwal 3320–36. Proc. Mach. Learn. Res. 125 N.p.: PMLR
[Google Scholar]
156.
Agarwal N, Bullins B, Hazan E, Kakade S, Singh K 2019. Online control with adversarial disturbances. Proceedings of the 36th International Conference on Machine Learning K Chaudhuri, R Salakhutdinov 111–19. Proc. Mach. Learn. Res. 97 N.p.: PMLR
[Google Scholar]
157.
Palan M, Barratt S, McCauley A, Sadigh D, Sindhwani V, Boyd S 2020. Fitting a linear control policy to demonstrations with a Kalman constraint. Proceedings of the 2nd Conference on Learning for Dynamics and Control AM Bayen, A Jadbabaie, G Pappas, PA Parrilo, B Recht, et al. 374–83. Proc. Mach. Learn. Res. 120 N.p.: PMLR
[Google Scholar]
158.
Havens A, Hu B. 2021. On imitation learning of linear control policies: enforcing stability and robustness constraints via LMI conditions. 2021 American Control Conference882–87. Piscataway, NJ: IEEE
[Google Scholar]
159.
Yin H, Seiler P, Jin M, Arcak M 2021. Imitation learning with stability and safety guarantees. IEEE Control Syst. Lett. 6:409–14
[Google Scholar]
160.
Tu S, Robey A, Zhang T, Matni N 2022. On the sample complexity of stability constrained imitation learning. Proceedings of the 4th Annual Learning for Dynamics and Control Conference R Firoozi, N Mehr, E Yel, R Antonova, J Bohg, et al. 180–91. Proc. Mach. Learn. Res. 168 N.p.: PMLR
[Google Scholar]
161.
Molybog I, Lavaei J. 2021. When does MAML objective have benign landscape?. 2021 IEEE Conference on Control Technology and Applications220–27. Piscataway, NJ: IEEE
[Google Scholar]

/content/journals/10.1146/annurev-control-042920-020021

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies

Annual Review of Control, Robotics, and Autonomous Systems 6, 123 (2023); https://doi.org/10.1146/annurev-control-042920-020021

/content/journals/10.1146/annurev-control-042920-020021

Data & Media loading...

Article Type: Review Article

Most Cited Most Cited RSS feed

- Planning and Decision-Making for Autonomous Vehicles
  
  Wilko Schwarting, Javier Alonso-Mora, and Daniela Rus
  
  Vol. 1 (2018), pp. 187–210
- Learning-Based Model Predictive Control: Toward Safe Learning in Control
  
  Lukas Hewing, Kim P. Wabersich, Marcel Menner, and Melanie N. Zeilinger
  
  Vol. 3 (2020), pp. 269–296
- Recent Advances in Robot Learning from Demonstration
  
  Harish Ravichandar, Athanasios S. Polydoros, Sonia Chernova, and Aude Billard
  
  Vol. 3 (2020), pp. 297–330
- A Tour of Reinforcement Learning: The View from Continuous Control
  
  Benjamin Recht
  
  Vol. 2 (2019), pp. 253–279
- Safe Learning in Robotics: From Learning-Based Control to Safe Reinforcement Learning
  
  Lukas Brunke, Melissa Greeff, Adam W. Hall, Zhaocong Yuan, Siqi Zhou, Jacopo Panerati, and Angela P. Schoellig
  
  Vol. 5 (2022), pp. 411–444
- Haptics: The Present and Future of Artificial Touch Sensation
  
  Heather Culbertson, Samuel B. Schorr, and Allison M. Okamura
  
  Vol. 1 (2018), pp. 385–409
- Magnetic Methods in Robotics
  
  Jake J. Abbott, Eric Diller, and Andrew J. Petruska
  
  Vol. 3 (2020), pp. 57–90
- A Century of Robotic Hands
  
  C. Piazza, G. Grioli, M.G. Catalano, and A. Bicchi
  
  Vol. 2 (2019), pp. 1–32
- Integrated Task and Motion Planning
  
  Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez
  
  Vol. 4 (2021), pp. 265–293
- Distributed Optimization for Control
  
  Angelia Nedić, and Ji Liu
  
  Vol. 1 (2018), pp. 77–103
More Less

Annual Review of Control, Robotics, and Autonomous Systems

Volume 6, 2023

Review Article

Open Access

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies

Abstract

Most Read This Month

Most Cited Most Cited RSS feed