CSE 150 Lecture Notes - Lecture 19: Conditional Independence, Recommender System, Bellman Equation

59 views3 pages
πΣƔπ∞Δ∞
Greedy Policies
π’(s) def argmaxa Qπ(s,a)
Thm: Vπ’(s) >= Vπ(s) for ALL states s
Proof: Vπ(s) = Qπ(s,π(s))
<= maxa Qπ(s, a)
= Qπ(s,π’(s))
= R(s) + ƔΣs’ P(s’|s,π’(s)) Vπ(s’)
So far better to [take one step under π’, then revert to π], than to follow π
One-step inequality:
Vπ(s) <= R(s) + ƔΣs’ P(s’|s,π’(s)) Vπ(s’)
Apply inequality ro RHS:
Vπ(s) <= R(s) + ƔΣs’ P(s’|s,π’(s)) [R(s’) + ƔΣs’’ P(s’’|s’,π’(s’)) Vπ(s’’)]
“Better to [take 2 steps under π’, then revert to π], than ALWAYS following π”
Apply “one-step” inequality to RHS “t” times
Vπ(s) <= R(s) + … Vπ(s’’’’’) ← t+1 primes!
“Better to [take t steps under π’, then revert to π], than ALWAYS following π”
Take t → ∞: Better to follow π’ than π!
This shows Vπ(s) <= Vπ’(s)
Policy Iteration
How to compute an optimal policy π* such that Vπ*(s) >= Vπ(s) for ALL policies and states?
Thm: suppose π’(s) = π(s) for ALL states s (i.e. Vπ(s) = Vπ’(s)) Then Vπ(s) = Vπ*(s) and π(s) is
optimal
Proof:
Part a) From Bellman eqn:
Vπ’(s) <= R(s) + ƔΣs’ P(s’|s,π’(s)) Vπ’(s’)
By assumption Vπ’(s) = Vπ(s) after converging
Vπ(s) = R(s) + ƔΣs’ P(s’|s,π’(s)) Vπ(s’)
By assumption, π’(s) is greedy w/ value function Vπ
Vπ(s’) = R(s) + Ɣ maxa Σs’ P(s’|s,a) Vπ(s’) [Bellman optimality equation]
Iterate this equality into RHS:
Vπ(s’) = R(s) + Ɣ maxa Σs’ P(s’|s,a) [R(s’) + Ɣ maxa’Σs’’ P(s’’|s’,a’) Vπ(s’’)]
Do this again & again…
Imagine the RHS that “results”...
Part b) Let π~(s) be ANY other (non-optimal) policy
From Bellman eqn
Unlock document

This preview shows page 1 of the document.
Unlock all 3 pages and 3 million more documents.

Already have an account? Log in

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents