132 lines
6 KiB
Markdown
132 lines
6 KiB
Markdown
|
MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards
|
|||
|
|
|||
|
MDPs involve delayed rewards and the need to trade off immediate and delayed rewards
|
|||
|
|
|||
|
Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection
|
|||
|
|
|||
|
|
|||
|
- MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
|
|||
|
- The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
|
|||
|
- At each timestep, the agent receives some representation of the environment state and on that basis selects an action
|
|||
|
- One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state
|
|||
|
|
|||
|
- Markov decision processes formally describe an environment for reinforcement learning
|
|||
|
- Where the environment is fully observable
|
|||
|
- i.e. The current state completely characterises the process
|
|||
|
- Almost all RL problems can be formalised as MDPs
|
|||
|
- e.g. Bandits are MDPs with one state
|
|||
|
|
|||
|
Markov property
|
|||
|
“The future is independent of the past given the present”
|
|||
|
![[Pasted image 20241030102226.png]]![[Pasted image 20241030102243.png]]
|
|||
|
|
|||
|
- A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
|
|||
|
- S finite set of states
|
|||
|
- P is a state transition probability matrix
|
|||
|
- then $Pss′ = ℙ [St+1=s′ | St=s]$
|
|||
|
|
|||
|
Example
|
|||
|
![[Pasted image 20241030102420.png]]
|
|||
|
![[Pasted image 20241030102722.png]]
|
|||
|
|
|||
|
#### Markov Reward Process
|
|||
|
This is a Markov Process but we also have a reward function! We also have a discount factor.
|
|||
|
|
|||
|
Markov Reward Process is a tuple ⟨S, P, R, γ⟩
|
|||
|
- S is a (finite) set of states
|
|||
|
- P is a state transition probability matrix, $Pss′ = ℙ [ St+1=s′ | St=s ]$
|
|||
|
- R is a reward function, $Rs = 𝔼 [ R_{t+1} | St = s ]$
|
|||
|
- γ is a discount factor, $γ ∈ [0, 1]$
|
|||
|
|
|||
|
![[Pasted image 20241030103041.png]]
|
|||
|
|
|||
|
![[Pasted image 20241030103114.png]]
|
|||
|
- The discount $γ ∈ [0, 1]$ is the present value of future rewards
|
|||
|
- The value of receiving reward R after k + 1 time-steps is $γ^kR$
|
|||
|
- This values immediate reward above delayed reward
|
|||
|
- γ close to 0 leads to ”short-sighted” evaluation
|
|||
|
- γ close to 1 leads to ”far-sighted” evaluation
|
|||
|
|
|||
|
Most Markov reward and decision processes are discounted. Why?
|
|||
|
- mathematical convenience
|
|||
|
- avoids infinite returns in cyclic Markov processes
|
|||
|
- uncertainity about the future may not be fully represented
|
|||
|
- if the reward is financial, immediate rewards may earn more interest than delayed rewards
|
|||
|
- Animal/human behaviour shows preference for immediate rewards
|
|||
|
- It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate
|
|||
|
|
|||
|
Value function
|
|||
|
- The value function v(s) gives the long-term value of (being in) state s
|
|||
|
- The state value function v(s) of an MRP is the expected return starting from state s $𝑉) = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]$
|
|||
|
|
|||
|
![[Pasted image 20241030103519.png]]
|
|||
|
![[Pasted image 20241030103706.png]]
|
|||
|
is a prediction of the reward in next states
|
|||
|
|
|||
|
![[Pasted image 20241030103739.png]]
|
|||
|
![[Pasted image 20241030103753.png]]
|
|||
|
|
|||
|
- The value function can be decomposed into two parts:
|
|||
|
- immediate reward $R_{t+1}$
|
|||
|
- discounted value of successor state $γv(St+1)$
|
|||
|
|
|||
|
![[Pasted image 20241030103902.png]]
|
|||
|
|
|||
|
#### Bellman Equation for MRPs
|
|||
|
$v (s) = E [Rt+1 + v (St+1) | St = s]$
|
|||
|
![[Pasted image 20241030104056.png]]
|
|||
|
|
|||
|
- Bellman equation averages over all the possibilities, weighting each by its probability of occurring
|
|||
|
- The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way
|
|||
|
|
|||
|
![[Pasted image 20241030104229.png]]
|
|||
|
4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.
|
|||
|
|
|||
|
|
|||
|
![[Pasted image 20241030104451.png]]
|
|||
|
|
|||
|
#### Solving the Bellman Equation
|
|||
|
- The Bellman equation is a linear equation
|
|||
|
- can be solved directly![[Pasted image 20241030104537.png]]
|
|||
|
- complexity O(n^3)
|
|||
|
- many iterative methods
|
|||
|
- dynamic programming
|
|||
|
- monte-carlo evaluation
|
|||
|
- temporal-difference learning
|
|||
|
|
|||
|
#### MDP
|
|||
|
![[Pasted image 20241030104632.png]]
|
|||
|
|
|||
|
![[Pasted image 20241030104722.png]]
|
|||
|
|
|||
|
Before we had random probabilities, now we have actions to chose from. But how do we chose?
|
|||
|
We have policies: a distribution over actions given the states: $$𝜋(a|s)= ℙ [ At=a | St=s ]$$
|
|||
|
- policy fully defines the behavior of the agent
|
|||
|
- MDP policies depend on the current state (not the history)
|
|||
|
- policies are stationary (time-independent, depend only on the state but not on the time)
|
|||
|
|
|||
|
|
|||
|
##### Value function
|
|||
|
The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 $$v𝜋(s) = 𝔼𝜋 [ Gt | St=s ]$$
|
|||
|
The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 $$q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]$$
|
|||
|
![[Pasted image 20241030105022.png]]
|
|||
|
|
|||
|
- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v\pi(s) = E\pi[Rt+1 + v⇡(St+1) | St = s]$$
|
|||
|
- The action-value function can similarly be decomposed $$q\pi(s, a) = E\pi [Rt+1 + q⇡(St+1, At+1) | St = s, At = a]$$
|
|||
|
![[Pasted image 20241030105148.png]]![[Pasted image 20241030105207.png]]
|
|||
|
![[Pasted image 20241030105216.png]]
|
|||
|
putting all together
|
|||
|
(very important, remeber it)
|
|||
|
|
|||
|
![[Pasted image 20241030105234.png]]
|
|||
|
|
|||
|
![[Pasted image 20241030105248.png]]
|
|||
|
as we can see, an action does not necessarily bring to a specific state.
|
|||
|
|
|||
|
|
|||
|
Example: Gridworld
|
|||
|
2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso
|
|||
|
![[Pasted image 20241030112555.png]]
|
|||
|
![[Pasted image 20241030113041.png]]
|
|||
|
![[Pasted image 20241030113135.png]]
|
|||
|
C is low because from C I get reward of 0 everywhere I go.
|