CSE 150 Lecture Notes - Lecture 16: State Space, Indep
Markov Decision Processes (MDPs)
Definition:
-State space S with states s ∈ S
-Action space A with actions a ∈ A
-Transition probabilities for ALL state-action pairs (s,a)
P(s’|s,a) = P(st+1 = s’|st = s,at= a) ← prob moving s → s’ after taking action a
Assumptions on probs
1. P(st+1|st,at) = P(st+1|st,at,st-1,at-1,...,s1,a1) (conditional indep.)
2. P(st+1 = s’|st = s,at= a) = P(st+1+k|st+k = s,at+k = a) for ALL integers k (time-independent function)
Reward function
R(s,s’,a) = real-valued reward AFTER taking action a
in state s
and landing in state s’
Simplifications for CSE 150
1.) state AND action spaces are discrete and finite
2.) reward function R(s,s’,a) = R(s) = Rs depends ONLY on the current state!
3.) rewards are bounded |RS| < ∞ for ALL states (and deterministic)
Example: board-game w/ dice
S = board positions AND rolls of dice
A = set of possible “moves”
P(s’|s,a) = {HOW state changes to agent’s move, opponents roll of dice, opponents move,
and agent’s roll of dice}
R(s) = { +1 win
-1 lose
0 ALL other states }
Decision-making
-policy: “deterministic” mapping/assignment of states to actions
π: S → A
# policies: |A||S| ← exponentially large in state space!
Dynamics under policy π
P(s’|s,π(s)) ← action under π in state s
Experience under policy π
State s0→ a0 = π(s0) → s1→ a1 = π(s1) → ...
Reward r0 = R(s0) r1 = R(s1)
How to measure accumulated rewards over time?
Discount factor 0 <= < 1
Long-term discounted return = Σt=0∞ t rt
Document Summary
State space s with states s s. Action space a with actions a a. P(s"|s,a) = p(s t+1 = s"|s t = s,a t = a) prob moving s s" after taking action a ( conditional indep. ) Assumptions on probs: p(s t+1 |s t ,a t ) = p(s t+1 |s t ,a t ,s t-1 ,a t-1 ,,s 1 ,a 1 , p(s t+1 = s"|s t = s,a t = a) = p(s t+1+ k |s t+ k = s,a t+ k = a) for all integers k ( time-independent function) R(s,s",a) = real-valued reward after taking action a in state s and landing in state s". 1. ) state and action spaces are discrete and finite. 2. ) reward function r(s,s",a) = r(s) = r s depends only on the current state! 3. ) rewards are bounded |r s | < for all states ( and deterministic ) S = board positions and rolls of dice.