132 lines
No EOL
6 KiB
Markdown
132 lines
No EOL
6 KiB
Markdown
MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards
|
||
|
||
MDPs involve delayed rewards and the need to trade off immediate and delayed rewards
|
||
|
||
Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection
|
||
|
||
|
||
- MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
|
||
- The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
|
||
- At each timestep, the agent receives some representation of the environment state and on that basis selects an action
|
||
- One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state
|
||
|
||
- Markov decision processes formally describe an environment for reinforcement learning
|
||
- Where the environment is fully observable
|
||
- i.e. The current state completely characterises the process
|
||
- Almost all RL problems can be formalised as MDPs
|
||
- e.g. Bandits are MDPs with one state
|
||
|
||
Markov property
|
||
“The future is independent of the past given the present”
|
||
![[Pasted image 20241030102226.png]]![[Pasted image 20241030102243.png]]
|
||
|
||
- A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
|
||
- S finite set of states
|
||
- P is a state transition probability matrix
|
||
- then $Pss′ = ℙ [St+1=s′ | St=s]$
|
||
|
||
Example
|
||
![[Pasted image 20241030102420.png]]
|
||
![[Pasted image 20241030102722.png]]
|
||
|
||
#### Markov Reward Process
|
||
This is a Markov Process but we also have a reward function! We also have a discount factor.
|
||
|
||
Markov Reward Process is a tuple ⟨S, P, R, γ⟩
|
||
- S is a (finite) set of states
|
||
- P is a state transition probability matrix, $Pss′ = ℙ [ St+1=s′ | St=s ]$
|
||
- R is a reward function, $Rs = 𝔼 [ R_{t+1} | St = s ]$
|
||
- γ is a discount factor, $γ ∈ [0, 1]$
|
||
|
||
![[Pasted image 20241030103041.png]]
|
||
|
||
![[Pasted image 20241030103114.png]]
|
||
- The discount $γ ∈ [0, 1]$ is the present value of future rewards
|
||
- The value of receiving reward R after k + 1 time-steps is $γ^kR$
|
||
- This values immediate reward above delayed reward
|
||
- γ close to 0 leads to ”short-sighted” evaluation
|
||
- γ close to 1 leads to ”far-sighted” evaluation
|
||
|
||
Most Markov reward and decision processes are discounted. Why?
|
||
- mathematical convenience
|
||
- avoids infinite returns in cyclic Markov processes
|
||
- uncertainity about the future may not be fully represented
|
||
- if the reward is financial, immediate rewards may earn more interest than delayed rewards
|
||
- Animal/human behaviour shows preference for immediate rewards
|
||
- It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate
|
||
|
||
Value function
|
||
- The value function v(s) gives the long-term value of (being in) state s
|
||
- The state value function v(s) of an MRP is the expected return starting from state s $𝑉) = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]$
|
||
|
||
![[Pasted image 20241030103519.png]]
|
||
![[Pasted image 20241030103706.png]]
|
||
is a prediction of the reward in next states
|
||
|
||
![[Pasted image 20241030103739.png]]
|
||
![[Pasted image 20241030103753.png]]
|
||
|
||
- The value function can be decomposed into two parts:
|
||
- immediate reward $R_{t+1}$
|
||
- discounted value of successor state $γv(St+1)$
|
||
|
||
![[Pasted image 20241030103902.png]]
|
||
|
||
#### Bellman Equation for MRPs
|
||
$v (s) = E [Rt+1 + v (St+1) | St = s]$
|
||
![[Pasted image 20241030104056.png]]
|
||
|
||
- Bellman equation averages over all the possibilities, weighting each by its probability of occurring
|
||
- The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way
|
||
|
||
![[Pasted image 20241030104229.png]]
|
||
4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.
|
||
|
||
|
||
![[Pasted image 20241030104451.png]]
|
||
|
||
#### Solving the Bellman Equation
|
||
- The Bellman equation is a linear equation
|
||
- can be solved directly![[Pasted image 20241030104537.png]]
|
||
- complexity O(n^3)
|
||
- many iterative methods
|
||
- dynamic programming
|
||
- monte-carlo evaluation
|
||
- temporal-difference learning
|
||
|
||
#### MDP
|
||
![[Pasted image 20241030104632.png]]
|
||
|
||
![[Pasted image 20241030104722.png]]
|
||
|
||
Before we had random probabilities, now we have actions to chose from. But how do we chose?
|
||
We have policies: a distribution over actions given the states: $$𝜋(a|s)= ℙ [ At=a | St=s ]$$
|
||
- policy fully defines the behavior of the agent
|
||
- MDP policies depend on the current state (not the history)
|
||
- policies are stationary (time-independent, depend only on the state but not on the time)
|
||
|
||
|
||
##### Value function
|
||
The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 $$v𝜋(s) = 𝔼𝜋 [ Gt | St=s ]$$
|
||
The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 $$q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]$$
|
||
![[Pasted image 20241030105022.png]]
|
||
|
||
- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v\pi(s) = E\pi[Rt+1 + v⇡(St+1) | St = s]$$
|
||
- The action-value function can similarly be decomposed $$q\pi(s, a) = E\pi [Rt+1 + q⇡(St+1, At+1) | St = s, At = a]$$
|
||
![[Pasted image 20241030105148.png]]![[Pasted image 20241030105207.png]]
|
||
![[Pasted image 20241030105216.png]]
|
||
putting all together
|
||
(very important, remeber it)
|
||
|
||
![[Pasted image 20241030105234.png]]
|
||
|
||
![[Pasted image 20241030105248.png]]
|
||
as we can see, an action does not necessarily bring to a specific state.
|
||
|
||
|
||
Example: Gridworld
|
||
2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso
|
||
![[Pasted image 20241030112555.png]]
|
||
![[Pasted image 20241030113041.png]]
|
||
![[Pasted image 20241030113135.png]]
|
||
C is low because from C I get reward of 0 everywhere I go. |