CSE 150 Lecture Notes - Lecture 18: Royal Institute Of Technology, Discounting, Markov Decision Process

50 views2 pages
πΣƔπ∞Δ
Policy Iteration
Thm: policy iteration converges in finite # steps to an optimal policy π*
Pros/cons:
(+) converges “fairly” quickly
(-) each step requires solving linear equations
O(n3) for MDP with n states
Q: What to do if n is “prohibitively” large?
Answer 2: Compute V*(s) directly
Optimality in MDPs
Thm: there is (at least) one “optimal” policy π* for which V*(s) >= Vπ(s) for ALL policies and
states
Optimal value functions:
V*(s) def Vπ*(s)
Q*(s,a) def Qπ*(s,a)
Given π*(s), we can compute V*(s) by solving linear equations!
V*(s) = R(s) + ƔΣs’ P(s’|s,π*(s)) V*(s’)
And we can compute
Q*(s,a) = R(s) + ƔΣs’ P(s’|s,π*(s)) V*(s’)
Given V*(s), can we recover π*(s) ? YES
How? Compute Q*(s,a) using above
Then π*(s) = argmaxa Q*(s,a)
Value iteration
How to compute V*(s) “directly”?
V*(s) = maxa Q*(s,a)
= maxa [R(s) + ƔΣs’ P(s’|s,π*(s)) V*(s’)]
= R(s) + Ɣ maxa [Σs’ P(s’|s,a) V*(s’)] [Bellman optimality equations]
for s = 1 to n where n = # states in MDP
Bellman Optimality Equations:
n equations s = 1 to n
n unknowns V*(s), s = 1 to n
These equations are NONLINEAR
How to solve?
Algorithm for value iteration
(1) Initialize V0(s) = 0 //array of guessed solutions = 0
(2) Iterate for s = 1 to n:
Vk+1(s) = R(s) + Ɣ maxa [Σs’=1
n P(s’|s,a) Vk(s’)]
Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents