master-degree-notes/Autonomous Networking/notes/9 Markov processes.md

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards

MDPs involve delayed rewards and the need to trade off immediate and delayed rewards

Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection


- MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
- The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
- At each timestep, the agent receives some representation of the environment state and on that basis selects an action
- One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state

- Markov decision processes formally describe an environment for reinforcement learning
- Where the environment is fully observable
-  i.e. The current state completely characterises the process
- Almost all RL problems can be formalised as MDPs
	- e.g. Bandits are MDPs with one state

Markov property
“The future is independent of the past given the present”
![[Pasted image 20241030102226.png]]![[Pasted image 20241030102243.png]]

- A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
	- S finite set of states
	- P is a state transition probability matrix
	- then $Pss′ = ℙ [St+1=s′ | St=s]$

Example
![[Pasted image 20241030102420.png]]
![[Pasted image 20241030102722.png]]

#### Markov Reward Process
This is a Markov Process but we also have a reward function! We also have a discount factor.

 Markov Reward Process is a tuple ⟨S, P, R, γ⟩
 - S is a (finite) set of states
 - P is a state transition probability matrix, $Pss′ = ℙ [ St+1=s′ | St=s ]$
 - R is a reward function, $Rs = 𝔼 [ R_{t+1} | St = s ]$
 - γ is a discount factor, $γ ∈ [0, 1]$

![[Pasted image 20241030103041.png]]

 ![[Pasted image 20241030103114.png]]
-  The discount $γ ∈ [0, 1]$ is the present value of future rewards
-  The value of receiving reward R after k + 1 time-steps is $γ^kR$
- This values immediate reward above delayed reward
	- γ close to 0 leads to ”short-sighted” evaluation
	- γ close to 1 leads to ”far-sighted” evaluation

 Most Markov reward and decision processes are discounted. Why?
 - mathematical convenience
 - avoids infinite returns in cyclic Markov processes
 - uncertainity about the future may not be fully represented
 - if the reward is financial, immediate rewards may earn more interest than delayed rewards
 - Animal/human behaviour shows preference for immediate rewards
 - It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate

Value function
- The value function v(s) gives the long-term value of (being in) state s
- The state value function v(s) of an MRP is the expected return starting from state s $𝑉 = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]$

![[Pasted image 20241030103519.png]]
![[Pasted image 20241030103706.png]]
is a prediction of the reward in next states

![[Pasted image 20241030103739.png]]
![[Pasted image 20241030103753.png]]

- The value function can be decomposed into two parts:
	- immediate reward $R_{t+1}$
	- discounted value of successor state $γv(St+1)$

![[Pasted image 20241030103902.png]]

#### Bellman Equation for MRPs
$v (s) = E [Rt+1 + v (St+1) | St = s]$
![[Pasted image 20241030104056.png]]

- Bellman equation averages over all the possibilities, weighting each by its probability of occurring
- The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way

![[Pasted image 20241030104229.png]]
4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.


![[Pasted image 20241030104451.png]]

#### Solving the Bellman Equation
- The Bellman equation is a linear equation
- can be solved directly![[Pasted image 20241030104537.png]]
- complexity O(n^3)
- many iterative methods
	- dynamic programming
	- monte-carlo evaluation
	- temporal-difference learning

#### MDP
![[Pasted image 20241030104632.png]]

![[Pasted image 20241030104722.png]]

Before we had random probabilities, now we have actions to chose from. But how do we chose?
We have policies: a distribution over actions given the states: $$𝜋(a|s)= ℙ [ At=a | St=s ]$$
- policy fully defines the behavior of the agent
- MDP policies depend on the current state (not the history)
- policies are stationary (time-independent, depend only on the state but not on the time)


##### Value function
The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 $$v𝜋(s) = 𝔼𝜋 [ Gt | St=s ]$$
The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 $$q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]$$
![[Pasted image 20241030105022.png]]

- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v_{\pi}(s) = E_{\pi}[R_{t+1} + v_{\pi}(S_{t+1}) | St = s]$$
- The action-value function can similarly be decomposed $$q_{\pi}(s, a) = E_{\pi}[R_{t+1} + q_{\pi}(S_{t+1}, A_{t+1}) | St = s, At = a]$$
![[Pasted image 20241030105148.png]]![[Pasted image 20241030105207.png]]
![[Pasted image 20241030105216.png]]
putting all together
(very important, remeber it)

![[Pasted image 20241030105234.png]]

![[Pasted image 20241030105248.png]]
as we can see, an action does not necessarily bring to a specific state.


Example: Gridworld
2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso
![[Pasted image 20241030112555.png]]
![[Pasted image 20241030113041.png]]
![[Pasted image 20241030113135.png]]
C is low because from C I get reward of 0 everywhere I go.