Agile Robotics: Optimal Control, Reinforcement Learning, and Differentiable Simulation

Yunlong Song and Davide Scaramuzza
Robotics and Perception Group, University of Zurich

Continuous Time Optimal Control Problem Minimize a cost function over a time horizon: minx(),u()0T(x(t),u(t),t)𝑑t+(x(T))subscript𝑥𝑢superscriptsubscript0𝑇𝑥𝑡𝑢𝑡𝑡differential-d𝑡𝑥𝑇\min_{x(\cdot),u(\cdot)}\int_{0}^{T}\ell(x(t),u(t),t)\,dt+\ell(x(T))roman_min start_POSTSUBSCRIPT italic_x ( ⋅ ) , italic_u ( ⋅ ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_ℓ ( italic_x ( italic_t ) , italic_u ( italic_t ) , italic_t ) italic_d italic_t + roman_ℓ ( italic_x ( italic_T ) ) Control Method Model Predictive Control Policy Search Backpropagation Through Time Optimization Objective J(x,u)=k=0N1(xk,uk)+(xN)𝐽𝑥𝑢superscriptsubscript𝑘0𝑁1subscript𝑥𝑘subscript𝑢𝑘subscript𝑥𝑁J(x,u)=\sum_{k=0}^{N-1}\ell(x_{k},u_{k})+\ell(x_{N})italic_J ( italic_x , italic_u ) = ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) J(θ)=𝔼τπθ[k=0Nrk]𝐽𝜃subscript𝔼similar-to𝜏subscript𝜋𝜃delimited-[]superscriptsubscript𝑘0𝑁subscript𝑟𝑘J(\theta)=\mathbb{E}_{\tau\sim\pi_{\theta}}\left[\sum_{k=0}^{N}r_{k}\right]italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] J(θ)=𝔼x0p(x0)[k=0N1(xk,uk)+(xN)]𝐽𝜃subscript𝔼similar-tosubscript𝑥0𝑝subscript𝑥0delimited-[]superscriptsubscript𝑘0𝑁1subscript𝑥𝑘subscript𝑢𝑘subscript𝑥𝑁J(\theta)=\mathbb{E}_{x_{0}\sim p(x_{0})}\left[\sum_{k=0}^{N-1}\ell(x_{k},u_{k% })+\ell(x_{N})\right]italic_J ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + roman_ℓ ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) ] Constraints s.t. {x0=xinitxk+1=f(xk,uk)g(x,u)=0h(x,u)0casessubscript𝑥0subscript𝑥initotherwisesubscript𝑥𝑘1𝑓subscript𝑥𝑘subscript𝑢𝑘otherwise𝑔𝑥𝑢0otherwise𝑥𝑢0otherwise\begin{cases}x_{0}=x_{\text{init}}\\ x_{k+1}=f(x_{k},u_{k})\\ g(x,u)=0\\ h(x,u)\leq 0\end{cases}{ start_ROW start_CELL italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT init end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_g ( italic_x , italic_u ) = 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_h ( italic_x , italic_u ) ≤ 0 end_CELL start_CELL end_CELL end_ROW - - Decision Variables uk,xksubscript𝑢𝑘subscript𝑥𝑘u_{k},x_{k}italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT θ𝜃\thetaitalic_θ θ𝜃\thetaitalic_θ Optimization Method Nonlinear Programming Policy Gradient Analytical Gradient Control Law u0superscriptsubscript𝑢0u_{0}^{\ast}italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT uπθ(u|x)similar-to𝑢subscript𝜋superscript𝜃conditional𝑢𝑥u\sim\pi_{\theta^{\ast}}(u|x)italic_u ∼ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_u | italic_x ) u=πθ(x)𝑢subscript𝜋superscript𝜃𝑥u=\pi_{\theta^{\ast}}(x)italic_u = italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x )

TABLE I: Comparison of three methods for approximately solving the continuous-time optimal control problem.

Control systems are at the core of every real-world robot. They are deployed in an ever-increasing number of applications, ranging from autonomous racing and search-and-rescue missions to industrial inspections and space exploration. To achieve peak performance, certain tasks require pushing the robot to its maximum agility. How can we design control algorithms that enhance the agility of autonomous robots and maintain robustness against unforeseen disturbances? My research addresses this question by leveraging fundamental principles in optimal control, reinforcement learning, and differentiable simulation.

Optimal Control [3, 4], such as Model Predictive Control (MPC), relies on using an accurate mathematical model within an optimization framework and solving complex optimization problems online. Reinforcement Learning (RL) [26] optimizes a control policy to maximize a reward signal through trial and error. Differentiable Simulation [25, 7] promises better convergence and sample efficiency than RL by replacing zero-order gradient estimates of a stochastic objective with an estimate based on first-order gradients. An overview of these three approaches is summarized in Table I.

Particularly, model-free RL has recently achieved impressive results, demonstrating exceptional performance in various domains, such as autonomous drone racing [22, 23, 13] and quadrupedal locomotion over challenging terrain [12, 14, 15]. Some of the most impressive achievements of RL are beyond the reach of existing optimal control (OC) systems. However, most studies focus on system design; less attention has been paid to the systematic study of fundamental factors that have led to the success of RL or have limited OC.

It is important to highlight that the progress in applying RL to robot control is primarily driven by the enhanced computational capabilities provided by GPUs rather than breakthroughs in the algorithms. Consequently, researchers may resort to alternative strategies such as imitation learning [5] to circumvent these limitations in scenarios where data collection cannot be accelerated through computational means [1, 10, 6, 28]. This highlights the need to study the connection between RL, optimal control, and robot dynamics. I attempt to answer the following three research questions:

Research Question 1: What are the intrinsic benefits of reinforcement learning compared to optimal control?

Research Question 2: How to combine the advantage of reinforcement learning and optimal control?

Research Question 3: How to effectively leverage the dynamics of robots to improve policy training?

-A Reinforcement Learning versus Optimal Control

Refer to caption
Figure 1: RL outperforms optimal control in drone racing [23].

In [23], we investigate Research Question 1 by studying RL and OC from the perspective of the optimization method and optimization objective. We perform the investigation in a challenging real-world problem that involves a high-performance robotic system: autonomous drone racing.

On one hand, RL and OC are two different optimization methods and we can ask which method can achieve a more robust solution given the same cost function. On the other hand, given that RL and OC address a given robot control problem by optimizing different objectives, we can ask which optimization objective can lead to more robust task performance.

Our results indicate that the fundamental advantage of RL over OC lies in its optimization objective. Specifically, RL directly maximizes a task-level objective, which leads to more robust control performance in the presence of unmodeled dynamics and disturbance. In contrast, OC is limited by the requirement of optimizing a smooth and differentiable loss function, which in turn requires decomposing the task into planning and control, thus limiting the range of control policies that can be expressed by the system. In addition, RL can leverage domain randomization to achieve extra robustness and avoid overfitting, where the agent is trained on a variety of simulated environments with varying settings.

Our findings allow us to push an extremely agile drone to its maximum performance, achieving a peak acceleration greater than 12g and a peak velocity of 108  km h1timesabsenttimeskilometerhour1\text{\,}\mathrm{km}\text{\,}{\mathrm{h}}^{-1}start_ARG end_ARG start_ARG times end_ARG start_ARG start_ARG roman_km end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_h end_ARG start_ARG - 1 end_ARG end_ARG end_ARG. We show that the RL-based neural network policy outperforms state-of-the-art OC-based methods [8, 17] in terms of robustness and lap time because RL does not rely on pre-computed trajectory or path. Fig 1 displays time-lapse illustrations of the racing drone controlled by our RL policy in an indoor flying arena.

-B Policy Search for Model Predictive Control

In [21, 20], we investigate Research Question 2 by presenting a policy-search-for-model-predictive-control framework for merging learning and control. A visualization of the framework is given in Fig 2. We consider model predictive control (MPC) as a parameterized controller and formulate the search for hard-to-optimize decision variables as a probabilistic policy search problem. Given the predicted decision variables, MPC solves an optimization problem and generates control commands for the robot. A key advantage of our approach over the standard MPC formulation is that the high-level decision variables, which are difficult to optimize simultaneously with other state variables, can be learned offline and selected adaptively at runtime.

Refer to caption
Figure 2: Graphical model of policy search for MPC.

We validate this framework by focusing on a challenging problem in agile drone flight: flying a quadrotor through fast-moving gates. Flying through fast-moving gates is a proxy task to develop autonomous systems that can navigate the vehicle through rapidly changing environments. Our controller achieved robust and real-time control performance in both simulation and the real world. Additionally, this framework can be used for controller tuning [18].

-C Policy Learning via Differentiable Simulation

In [24], we investigate Research Question 3 by demonstrating the effectiveness of differentiable simulation for policy training. Differentiable simulation promises faster convergence and more stable training by computing low-variant first-order gradients using the robot model, but so far, its use for robot control has remained limited to simulation [27, 16, 11, 9].

Refer to caption
Figure 3: Graphical model of Differentiable Simulation.

In [24], we tackle the challenge of learning control policies for quadruped locomotion. The main challenge with differentiable simulation lies in the complex optimization landscape of robotic tasks due to discontinuities in contact-rich environments, particularly quadruped locomotion. We propose a new, differentiable simulation framework to overcome these challenges. The key idea involves decoupling the complex whole-body simulation, which may exhibit discontinuities due to contact, into two separate continuous domains. Our framework enables learning quadruped walking in minutes using a single simulated robot without any parallelization. When augmented with GPU parallelization, our approach allows the quadruped robot to master diverse locomotion skills, including trot, pace, bound, and gallop, on challenging terrains in minutes. Additionally, our policy achieves robust locomotion performance in the real world zero-shot. To the best of our knowledge, this work represents the first demonstration of using differentiable simulation for controlling a real quadruped robot. This work provides several important insights into using differentiable simulations for legged locomotion in the real world.

-D Future Work

Incorporating structured knowledge from robot dynamics and constraints from optimal control into reinforcement learning could potentially reduce the sample complexity and improve learning efficiency. This could involve using optimal control as parameterized policy [21] or as an implicit differentiable layer [2, 19] or integrating physical laws and safety constraints directly into the learning process.

In my future research, I plan to focus on developing advanced control frameworks that merge the precision and safety of optimal control with the adaptability and robustness of reinforcement learning. I plan to focus on more challenging robot control tasks, including vision-based humanoid locomotion. My ultimate objective is to achieve a level of locomotion performance comparable to that of a human, navigating through challenging terrains while avoiding obstacles.

References

  • Agarwal et al. [2023] Ananye Agarwal, Ashish Kumar, Jitendra Malik, and Deepak Pathak. Legged locomotion in challenging terrains using egocentric vision. In Conference on Robot Learning, pages 403–415. PMLR, 2023.
  • Amos et al. [2018] Brandon Amos, Ivan Jimenez, Jacob Sacks, Byron Boots, and J. Zico Kolter. Differentiable MPC for End-to-end Planning and Control. 2018.
  • Arthur Jr and Ho [1975] E Arthur Jr and Ju-Chi Ho. Applied optimal control: optimization, estimation, and control. Hemisphere, 1975.
  • Bertsekas [2012] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volume 1. Athena scientific, 2012.
  • Chen et al. [2020] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Krähenbühl. Learning by cheating. In Conference on Robot Learning, pages 66–75. PMLR, 2020.
  • Cheng et al. [2023] Xuxin Cheng, Kexin Shi, Ananye Agarwal, and Deepak Pathak. Extreme parkour with legged robots. arXiv preprint arXiv:2309.14341, 2023.
  • de Avila Belbute-Peres et al. [2018] Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. End-to-end differentiable physics for learning and control. Advances in neural information processing systems, 31, 2018.
  • Foehn et al. [2021] Philipp Foehn, Angel Romero, and Davide Scaramuzza. Time-optimal planning for quadrotor waypoint flight. Science Robotics, 6(56), 2021. doi: 10.1126/scirobotics.abh1221.
  • Freeman et al. [2021] C Daniel Freeman, Erik Frey, Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. Brax–a differentiable physics engine for large scale rigid body simulation. arXiv preprint arXiv:2106.13281, 2021.
  • Fu et al. [2023] Zipeng Fu, Xuxin Cheng, and Deepak Pathak. Deep whole-body control: learning a unified policy for manipulation and locomotion. In Conference on Robot Learning, pages 138–149. PMLR, 2023.
  • Huang et al. [2021] Zhiao Huang, Yuanming Hu, Tao Du, Siyuan Zhou, Hao Su, Joshua B Tenenbaum, and Chuang Gan. Plasticinelab: A soft-body manipulation benchmark with differentiable physics. arXiv preprint arXiv:2104.03311, 2021.
  • Hwangbo et al. [2019] Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 4(26):eaau5872, 2019.
  • Kaufmann et al. [2023] Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza. Champion-level drone racing using deep reinforcement learning. Nature, 620(7976):982–987, 2023.
  • Lee et al. [2020] Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47):eabc5986, 2020.
  • Miki et al. [2022] Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, and Marco Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics, 7(62):eabk2822, 2022.
  • Ren et al. [2023] Jiawei Ren, Cunjun Yu, Siwei Chen, Xiao Ma, Liang Pan, and Ziwei Liu. Diffmimic: Efficient motion mimicking with differentiable physics. 2023.
  • Romero et al. [2022] Angel Romero, Sihao Sun, Philipp Foehn, and Davide Scaramuzza. Model predictive contouring control for time-optimal quadrotor flight. IEEE Transactions on Robotics, pages 1–17, 2022. doi: 10.1109/TRO.2022.3173711.
  • Romero et al. [2023] Angel Romero, Shreedhar Govil, Gonca Yilmaz, Yunlong Song, and Davide Scaramuzza. Weighted maximum likelihood for controller tuning. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 1334–1341. IEEE, 2023.
  • Romero et al. [2024] Angel Romero, Yunlong Song, and Davide Scaramuzza. Actor-critic model predictive control. In IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024.
  • Song and Scaramuzza [2020] Yunlong Song and Davide Scaramuzza. Learning high-level policies for model predictive control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  • Song and Scaramuzza [2022] Yunlong Song and Davide Scaramuzza. Policy search for model predictive control with application to agile drone flight. IEEE Transactions on Robotics, pages 1–17, 2022. doi: 10.1109/TRO.2022.3141602.
  • Song et al. [2021] Yunlong Song, Mats Steinweg, Elia Kaufmann, and Davide Scaramuzza. Autonomous drone racing with deep reinforcement learning. In IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), 2021.
  • Song et al. [2023] Yunlong Song, Angel Romero, Matthias Müller, Vladlen Koltun, and Davide Scaramuzza. Reaching the limit in autonomous racing: Optimal control versus reinforcement learning. Science Robotics, 8(82):eadg1462, 2023.
  • Song et al. [2024] Yunlong Song, Sangbae Kim, and Davide Scaramuzza. Learning quadruped locomotion using differentiable simulation. arXiv preprint arXiv:2403.14864, 2024.
  • Suh et al. [2022] Hyung Ju Suh, Max Simchowitz, Kaiqing Zhang, and Russ Tedrake. Do differentiable simulators give better policy gradients? In International Conference on Machine Learning, pages 20668–20696. PMLR, 2022.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Xu et al. [2021] Jie Xu, Viktor Makoviychuk, Yashraj Narang, Fabio Ramos, Wojciech Matusik, Animesh Garg, and Miles Macklin. Accelerated policy learning with parallel differentiable simulation. In International Conference on Learning Representations, 2021.
  • Zhuang et al. [2023] Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christopher Atkeson, Soeren Schwertfeger, Chelsea Finn, and Hang Zhao. Robot parkour learning. arXiv preprint arXiv:2309.05665, 2023.