master-degree-notes/Autonomous Networking/notes/9 Markov processes.md at 382617bf06b867c60e5322ca00a4d86fdf8c08c1

Marco Realacci 382617bf06 vault backup: 2024-10-31 13:26:43

2024-10-31 13:26:43 +01:00

6 KiB

Raw Blame History

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards

MDPs involve delayed rewards and the need to trade off immediate and delayed rewards

Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection

MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
At each timestep, the agent receives some representation of the environment state and on that basis selects an action
One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state
Markov decision processes formally describe an environment for reinforcement learning
Where the environment is fully observable
i.e. The current state completely characterises the process
Almost all RL problems can be formalised as MDPs
- e.g. Bandits are MDPs with one state

Markov property “The future is independent of the past given the present” !!

A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
- S finite set of states
- P is a state transition probability matrix
- then Pss′ = ℙ [St+1=s′ | St=s]

Example ! !

Markov Reward Process

This is a Markov Process but we also have a reward function! We also have a discount factor.

Markov Reward Process is a tuple ⟨S, P, R, γ⟩

S is a (finite) set of states
P is a state transition probability matrix, Pss′ = ℙ [ St+1=s′ | St=s ]
R is a reward function, Rs = 𝔼 [ R_{t+1} | St = s ]
γ is a discount factor, γ ∈ [0, 1]

The discount γ ∈ [0, 1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γ^kR
This values immediate reward above delayed reward
- γ close to 0 leads to ”short-sighted” evaluation
- γ close to 1 leads to ”far-sighted” evaluation

Most Markov reward and decision processes are discounted. Why?

mathematical convenience
avoids infinite returns in cyclic Markov processes
uncertainity about the future may not be fully represented
if the reward is financial, immediate rewards may earn more interest than delayed rewards
Animal/human behaviour shows preference for immediate rewards
It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate

Value function

The value function v(s) gives the long-term value of (being in) state s
The state value function v(s) of an MRP is the expected return starting from state s 𝑉) = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]

! ! is a prediction of the reward in next states

! !

The value function can be decomposed into two parts:
- immediate reward R_{t+1}
- discounted value of successor state γv(St+1)

Bellman Equation for MRPs

v (s) = E [Rt+1 + v (St+1) | St = s] !

Bellman equation averages over all the possibilities, weighting each by its probability of occurring
The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way

! 4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.

Solving the Bellman Equation

The Bellman equation is a linear equation
can be solved directly!
complexity O(n^3)
many iterative methods
- dynamic programming
- monte-carlo evaluation
- temporal-difference learning

MDP

Before we had random probabilities, now we have actions to chose from. But how do we chose? We have policies: a distribution over actions given the states: 𝜋(a|s)= ℙ [ At=a | St=s ]

policy fully defines the behavior of the agent
MDP policies depend on the current state (not the history)
policies are stationary (time-independent, depend only on the state but not on the time)

Value function

The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 v𝜋(s) = 𝔼𝜋 [ Gt | St=s ] The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ] !

The state-value function can again be decomposed into immediate reward plus discounted value of successor state v\pi(s) = E\pi[Rt+1 + v⇡(St+1) | St = s]
The action-value function can similarly be decomposed q\pi(s, a) = E\pi [Rt+1 + q⇡(St+1, At+1) | St = s, At = a] !! ! putting all together (very important, remeber it)

! as we can see, an action does not necessarily bring to a specific state.

Example: Gridworld 2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso ! ! ! C is low because from C I get reward of 0 everywhere I go.

6 KiB Raw Blame History Unescape Escape

Markov Reward Process

Bellman Equation for MRPs

Solving the Bellman Equation

MDP

Value function

6 KiB

Raw Blame History