Abbeel Quigley Ng uimirl ICML 2006

29 %
71 %
Information about Abbeel Quigley Ng uimirl ICML 2006
Science-Technology

Published on September 25, 2007

Author: Nickel

Source: authorstream.com

Using Inaccurate Models in Reinforcement Learning:  Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University Overview:  Overview Reinforcement learning in high-dimensional continuous state-spaces. Model-based RL: Difficult to build an accurate model. Model-free RL: Often requires large numbers of real-life trials. We present a hybrid algorithm, which requires only an approximate model, a small number of real-life trials. Resulting policy is (locally) near-optimal. Experiments on flight simulator and real RC car. Reinforcement learning formalism:  Markov Decision Process (MDP) M = (S, A, T , H, s0, R ). S = n (continuous state space) Time varying, deterministic dynamics T = { ft : S x A ! S, t = 0,…,H}. Goal: find policy  : S ! A, that maximizes U() = E [ R (st) |  ]. Focus: task of trajectory following. Reinforcement learning formalism H t=0 Motivating Example:  Motivating Example Student-driver learning to make a 90 degree right turn Only a few trials needed. No accurate model. Student-driver has access to: Real-life trial. Crude model. Result: good policy gradient estimate. Algorithm Idea:  Input to algorithm: approximate model. Start by computing the optimal policy according to the model. Algorithm Idea Real-life trajectory Target trajectory The policy is optimal according to the model, so no improvement is possible based on the model. Algorithm Idea (2):  Algorithm Idea (2) Update the model such that it becomes exact for the current policy. Algorithm Idea (2):  Algorithm Idea (2) Update the model such that it becomes exact for the current policy. Algorithm Idea (2):  Algorithm Idea (2) The updated model perfectly predicts the state sequence obtained under the current policy. We can use the updated model to find an improved policy. Algorithm:  Algorithm Find the (locally) optimal policy  for the model. Execute the current policy  and record the state trajectory. Update the model such that the new model is exact for the current policy . Use the new model to compute the policy gradient  and update the policy:  :=  +  . Go back to Step 2. Notes: The step-size parameter  is determined by a line search. Instead of the policy gradient, any algorithm that provides a local policy improvement direction can be used. In our experiments we used differential dynamic programming. Performance Guarantees: Intuition:  Performance Guarantees: Intuition Exact policy gradient: Model based policy gradient: Evaluation of derivatives along wrong trajectory Derivative of approximate transition function Our algorithm eliminates one (of two) sources of error. Performance Guarantees:  Performance Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give the same performance guarantees for model-based RL. The constant K depends only on the dimensionality of the state, action, and policy (), the horizon H and an upper bound on the 1st and 2nd derivatives of the transition model, the policy and the reward function. Experiments:  Experiments We use differential dynamic programming (DDP) to find control policies in the model. Two Systems: Flight Simulator RC Car Flight Simulator Setup:  Flight Simulator Setup Flight simulator model has 43 parameters (mass, inertia, drag coefficients, lift coefficients etc.). We generated 'approximate models' by randomly perturbing the parameters. All 4 standard fixed-wing control actions: throttle, ailerons, elevators and rudder. Our reward function quadratically penalizes for deviation from the desired trajectory. Flight Simulator Movie:  Flight Simulator Movie Flight Simulator Results:  Flight Simulator Results desired trajectory model-based controller our algorithm 76% utility improvement over model-based approach RC Car Setup:  RC Car Setup Control actions: throttle and steering. Low-speed dynamics model with state variables: Position, velocity, heading, heading rate. Model estimated from 30 minutes of data. RC Car: Open-Loop Turn:  RC Car: Open-Loop Turn RC Car: Circle:  RC Car: Circle RC Car: Figure-8 Maneuver:  RC Car: Figure-8 Maneuver Related Work:  Related Work Iterative Learning Control: Uchiyama (1978), Longman et al. (1992), Moore (1993), Horowitz (1993), Bien et al. (1991), Owens et al. (1995), Chen et al. (1997), … Successful robot control with limited number of trials: Atkeson and Schaal (1997), Morimoto and Doya (2001). Robust control theory: Zhou et al. (1995), Dullerud and Paganini (2000), … Bagnell et al. (2001), Morimoto and Atkeson (2002), … Conclusion:  Conclusion We presented an algorithm that uses a crude model and a small number of real-life trials to find a policy that works well in real-life. Our theoretical results show that----assuming a deterministic setting and assuming a reasonable model----our algorithm returns a policy that is (locally) near-optimal. Our experiments show that our algorithm can significantly improve on purely model-based RL by using only a small number of real-life trials, even when the true system is not deterministic. Slide22:  Motivating Example:  Motivating Example Student-driver learning to make a 90 degree right turn Only a few trials needed. No accurate model. Key aspects Real-life trial: shows whether turn is wide or short. Crude model: turning steering wheel more to the right results in sharper turn, turning steering wheel more to the left results in wider turn. Result: good policy gradient estimate.

Add a comment

Related presentations

Related pages

Using Inaccurate Models in Reinforcement Learning

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel pabbeel@cs.stanford.edu Morgan Quigley mquigley@cs.stanford.edu Andrew Y. Ng ang@cs ...
Read more

Using Inaccurate Models in Reinforcement Learning Pieter

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University
Read more

Pieter Abbeel | CiBER - CiBER | Center for ...

Pieter Abbeel. Assistant Professor ... P. Abbeel, M. Quigley and A. Y. Ng. Using Inaccurate Models in Reinforcement Learning. ... (ICML), 2006. P. Abbeel, ...
Read more

dblp: ICML 2006 - dblp: computer science bibliography

Bibliographic content of ICML 2006. ... > Home > Conferences and Workshops > ICML; ... Pieter Abbeel, Morgan Quigley, Andrew Y. Ng:
Read more

Publications

Publications. bibtex [95] Deep ... Adam Coates, Pieter Abbeel and Andrew Y. Ng. In Proceedings of ICML, ... Pieter Abbeel, Morgan Quigley and Andrew Y. Ng.
Read more

Faculty Publications | EECS at UC Berkeley

... Proc. 20th Annual Conf. (NIPS 2006), B. Scholkopf, J. Platt, and T. Hofmann, Eds., ... P. Abbeel, A. Coates, M. Quigley, and A. Y. Ng, ...
Read more

Publications - Computer Science Division | EECS at UC Berkeley

Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. ... and Andrew Y. Ng. In NIPS 18, 2006. (ps ... and Andrew Y. Ng. In Proceedings of ICML, 2005 ...
Read more

CiteSeerX — Citation Query Using inaccurate models in ...

by Pieter Abbeel, Morgan Quigley, Andrew Y. Ng - In International Conference on Machine Learning (ICML) Pittsburgh, 2006 " ... In the model-based ...
Read more

publications - Stanford AI Lab

Morgan Quigley, Pieter Abbeel, Dave S. De Lorenzo, Yi Gu, ... Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proceedings of ICML, 2006. (ps, pdf ...
Read more

papers [Morgan Quigley] - Stanford University

Morgan Quigley. home · projects · ... P. Abbeel, M. Quigley, and A. Y. Ng, ... (ICML), 2006. pdf. M. A. Goodrich, B. S. Morse, D. Gerhardt, J. L. Cooper, ...
Read more