CS486 Lecture Notes - Lecture 9: Markov Decision Process, Discounting

42 views2 pages

Document Summary

The markov decision process 10. 29/31. 18: a complete markov decision process represents a sequential decision problem with the following qualities: The states, actions, transition model, and reward function are defined: all transitions are markovian (the future is independent of the past given the present) Is there a finite (fixed # of time left) or infinite (no end time or deadline) horizon: a finite horizon means the problem is non-stationary and is harder to model. To calculate the utility of a sequence of states : additive rewards u(s0, s1, s2, ) = r(s0) + r(s1) + r(s2) + , discounted rewards u(s0, s1, s2, ) = r(s0) + r(s1) + 2r(s2) + . There is a chance tomorrow may not come with an infinite sequence of states, the total additive rewards is infinite whereas the total discounted rewards is finite: given u(s), determine the optimal policy as follows . The first term is the immediate reward of reaching state s.

Get access

Grade+20% off
$8 USD/m$10 USD/m
Billed $96 USD annually
Grade+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
40 Verified Answers
Class+
$8 USD/m
Billed $96 USD annually
Class+
Homework Help
Study Guides
Textbook Solutions
Class Notes
Textbook Notes
Booster Class
30 Verified Answers

Related Documents