COMPSCI 188 Lecture : CS188 - ALL.pdf

180 views27 pages
21 Oct 2013
School
Professor

Document Summary

Policy methods fixed policy - the tree is much simpler because now there is only one action per state expected total discounted rewards starting in s and following pi. How do we calculate these: same way as before -- value iteration, we now have a system of linear equations -- solve that way. Values iteration has its problems: really slow, policy converges long before the values do. Offline planning: we know the probabilities of everything, plan course before beginning the first action. Online planning: we don"t know the probabilities play to find probabilities -- reinforcement learning. Basic idea: receive feedback through rewards there is still an mdp, we just don"t know it we still assume there is an mdp, we"re still looking for a policy. * twist: we don"t know t(s, a, s") or r(s, a, s") (i. e. we don"t know which states are good or bad) failure is inevitable, we didn"t know it was bad, part of learning.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers