CSE 150 Lecture Notes - Lecture 17: Identity Matrix, Asteroid Family

48 views2 pages
πΣƔπ∞
Policy π: S → A maps states into actions
Longer-term discounted return Σt=0
Ɣt R(st)
- Discount factor 0 <= Ɣ < 1
- If Ɣ close to 0, factor decays quickly
- If Ɣ close to 1, factor decays slowly
State value function
Vπ(s) = Eπt=0
Ɣt R(st)|s0 = s]
Thm: There exists at least one optimal policy π*for which Vπ*(s) >= Vπ(s) for ALL states and
policies π
Planning
Assume complete model of environment as MDP = {S, A, P(s’|s,a) R(s)}, also Ɣ= 1
How compute π*(s) ?
Start w/ simpler problems…
1) Policy evaluation - How to compute state value function?
Vπ(s) = Eπt=0
Ɣt R(st)|s0 = s]
From Bellman equations:
Vπ(s) = R(s) + Ɣ Σs’=1
n P(s’|s,π(s)) Vπ(s’) for s = 1 to n (# states in MDP)
Put ALL unknowns on LHS:
Vπ(s) - Ɣ Σs’=1
n P(s’|s,π(s)) Vπ(s’) = R(s)
Σs’=1
n [I(s,s’) - ƔP(s’|s,π(s))] Vπ(s’) = R(s)
Can rewrite as:
(I - ƔP) V = R where I = nxn identity matrix, P = sxs’ matrix of [P(s’|s,π(s))]
Ex: states s {0,1}, transitions P(s’,s,π(s)), rewards R(s) = (R(s = 0), R(s = 1)) = (r0,r1), state
value function Vπ(s) = (Vπ(s=0),Vπ(s=1)) = (v0,v1)
Solve:
( [ 1 0 ] - Ɣ [ po → 0, p0 → 1 ] ) ( v0 ) = ( r0 )
( [ 0 1 ] [ p1 → 0, p1 → 1 ] ) ( v1 ) ( r1 )
General Solution: V = (I - ƔP) R
- Inverse always exists for 0 <= Ɣ < 1
2) Policy Improvement
How to compute new policy π’ such that Vπ’(s) >= Vπ(s) for ALL states s?
Define Qπ(s,a) “action value function” - expected return starting from states, taking action a,
THEN following policy π (including if we return TO state s!)
= Eπt=0
Ɣt R(st)|s0 = s,a0 = a]
How to compute Qπ(s,a)?
Unlock document

This preview shows half of the first page of the document.
Unlock all 2 pages and 3 million more documents.

Already have an account? Log in

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents