MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards
MDPs involve delayed rewards and the need to trade off immediate and delayed rewards
Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection
- MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
- The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
- At each timestep, the agent receives some representation of the environment state and on that basis selects an action
- One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state
- Markov decision processes formally describe an environment for reinforcement learning
- Where the environment is fully observable
- i.e. The current state completely characterises the process
- Almost all RL problems can be formalised as MDPs
- e.g. Bandits are MDPs with one state
Markov property
“The future is independent of the past given the present”
- The value function can be decomposed into two parts:
- immediate reward $R_{t+1}$
- discounted value of successor state $γv(St+1)$
![[Pasted image 20241030103902.png]]
#### Bellman Equation for MRPs
$v (s) = E [Rt+1 + v (St+1) | St = s]$
![[Pasted image 20241030104056.png]]
- Bellman equation averages over all the possibilities, weighting each by its probability of occurring
- The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way
![[Pasted image 20241030104229.png]]
4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.
![[Pasted image 20241030104451.png]]
#### Solving the Bellman Equation
- The Bellman equation is a linear equation
- can be solved directly![[Pasted image 20241030104537.png]]
- complexity O(n^3)
- many iterative methods
- dynamic programming
- monte-carlo evaluation
- temporal-difference learning
#### MDP
![[Pasted image 20241030104632.png]]
![[Pasted image 20241030104722.png]]
Before we had random probabilities, now we have actions to chose from. But how do we chose?
We have policies: a distribution over actions given the states: $$𝜋(a|s)= ℙ [ At=a | St=s ]$$
- policy fully defines the behavior of the agent
- MDP policies depend on the current state (not the history)
- policies are stationary (time-independent, depend only on the state but not on the time)
##### Value function
The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 $$v𝜋(s) = 𝔼𝜋 [ Gt | St=s ]$$
The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 $$q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]$$
- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v_{\pi}(s) = E_{\pi}[R_{t+1} + v_{\pi}(S_{t+1}) | St = s]$$
- The action-value function can similarly be decomposed $$q_{\pi}(s, a) = E_{\pi}[R_{t+1} + q_{\pi}(S_{t+1}, A_{t+1}) | St = s, At = a]$$