6 KiB
MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards
MDPs involve delayed rewards and the need to trade off immediate and delayed rewards
Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection
-
MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
-
The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
-
At each timestep, the agent receives some representation of the environment state and on that basis selects an action
-
One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state
-
Markov decision processes formally describe an environment for reinforcement learning
-
Where the environment is fully observable
-
i.e. The current state completely characterises the process
-
Almost all RL problems can be formalised as MDPs
- e.g. Bandits are MDPs with one state
Markov property
“The future is independent of the past given the present”
!!
- A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
- S finite set of states
- P is a state transition probability matrix
- then
Pss′ = ℙ [St+1=s′ | St=s]
Markov Reward Process
This is a Markov Process but we also have a reward function! We also have a discount factor.
Markov Reward Process is a tuple ⟨S, P, R, γ⟩
- S is a (finite) set of states
- P is a state transition probability matrix,
Pss′ = ℙ [ St+1=s′ | St=s ]
- R is a reward function,
Rs = 𝔼 [ R_{t+1} | St = s ]
- γ is a discount factor,
γ ∈ [0, 1]
- The discount
γ ∈ [0, 1]
is the present value of future rewards - The value of receiving reward R after k + 1 time-steps is
γ^kR
- This values immediate reward above delayed reward
- γ close to 0 leads to ”short-sighted” evaluation
- γ close to 1 leads to ”far-sighted” evaluation
Most Markov reward and decision processes are discounted. Why?
- mathematical convenience
- avoids infinite returns in cyclic Markov processes
- uncertainity about the future may not be fully represented
- if the reward is financial, immediate rewards may earn more interest than delayed rewards
- Animal/human behaviour shows preference for immediate rewards
- It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate
Value function
- The value function v(s) gives the long-term value of (being in) state s
- The state value function v(s) of an MRP is the expected return starting from state s
𝑉) = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]
!
!
is a prediction of the reward in next states
- The value function can be decomposed into two parts:
- immediate reward
R_{t+1}
- discounted value of successor state
γv(St+1)
- immediate reward
Bellman Equation for MRPs
v (s) = E [Rt+1 + v (St+1) | St = s]
!
- Bellman equation averages over all the possibilities, weighting each by its probability of occurring
- The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way
!
4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.
Solving the Bellman Equation
- The Bellman equation is a linear equation
- can be solved directly!
- complexity O(n^3)
- many iterative methods
- dynamic programming
- monte-carlo evaluation
- temporal-difference learning
MDP
Before we had random probabilities, now we have actions to chose from. But how do we chose?
We have policies: a distribution over actions given the states: 𝜋(a|s)= ℙ [ At=a | St=s ]
- policy fully defines the behavior of the agent
- MDP policies depend on the current state (not the history)
- policies are stationary (time-independent, depend only on the state but not on the time)
Value function
The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 v𝜋(s) = 𝔼𝜋 [ Gt | St=s ]
The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]
!
- The state-value function can again be decomposed into immediate reward plus discounted value of successor state
v\pi(s) = E\pi[Rt+1 + v⇡(St+1) | St = s]
- The action-value function can similarly be decomposed
q\pi(s, a) = E\pi [Rt+1 + q⇡(St+1, At+1) | St = s, At = a]
!!
!
putting all together (very important, remeber it)
!
as we can see, an action does not necessarily bring to a specific state.
Example: Gridworld
2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso
!
!
!
C is low because from C I get reward of 0 everywhere I go.