master-degree-notes/Autonomous Networking/notes/9 Markov processes.md
2024-11-02 16:28:37 +01:00

6 KiB
Raw Blame History

MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards

MDPs involve delayed rewards and the need to trade off immediate and delayed rewards

Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection

  • MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.

  • The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .

  • At each timestep, the agent receives some representation of the environment state and on that basis selects an action

  • One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state

  • Markov decision processes formally describe an environment for reinforcement learning

  • Where the environment is fully observable

  • i.e. The current state completely characterises the process

  • Almost all RL problems can be formalised as MDPs

    • e.g. Bandits are MDPs with one state

Markov property “The future is independent of the past given the present” !!

  • A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
    • S finite set of states
    • P is a state transition probability matrix
    • then Pss = [St+1=s | St=s]

Example ! !

Markov Reward Process

This is a Markov Process but we also have a reward function! We also have a discount factor.

Markov Reward Process is a tuple ⟨S, P, R, γ⟩

  • S is a (finite) set of states
  • P is a state transition probability matrix, Pss = [ St+1=s | St=s ]
  • R is a reward function, Rs = 𝔼 [ R_{t+1} | St = s ]
  • γ is a discount factor, γ ∈ [0, 1]

!

!

  • The discount γ ∈ [0, 1] is the present value of future rewards
  • The value of receiving reward R after k + 1 time-steps is γ^kR
  • This values immediate reward above delayed reward
    • γ close to 0 leads to ”short-sighted” evaluation
    • γ close to 1 leads to ”far-sighted” evaluation

Most Markov reward and decision processes are discounted. Why?

  • mathematical convenience
  • avoids infinite returns in cyclic Markov processes
  • uncertainity about the future may not be fully represented
  • if the reward is financial, immediate rewards may earn more interest than delayed rewards
  • Animal/human behaviour shows preference for immediate rewards
  • It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate

Value function

  • The value function v(s) gives the long-term value of (being in) state s
  • The state value function v(s) of an MRP is the expected return starting from state s 𝑉 = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]

! ! is a prediction of the reward in next states

! !

  • The value function can be decomposed into two parts:
    • immediate reward R_{t+1}
    • discounted value of successor state γv(St+1)

!

Bellman Equation for MRPs

v (s) = E [Rt+1 + v (St+1) | St = s] !

  • Bellman equation averages over all the possibilities, weighting each by its probability of occurring
  • The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way

! 4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.

!

Solving the Bellman Equation

  • The Bellman equation is a linear equation
  • can be solved directly!
  • complexity O(n^3)
  • many iterative methods
    • dynamic programming
    • monte-carlo evaluation
    • temporal-difference learning

MDP

!

!

Before we had random probabilities, now we have actions to chose from. But how do we chose? We have policies: a distribution over actions given the states: 𝜋(a|s)= [ At=a | St=s ]

  • policy fully defines the behavior of the agent
  • MDP policies depend on the current state (not the history)
  • policies are stationary (time-independent, depend only on the state but not on the time)
Value function

The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 v𝜋(s) = 𝔼𝜋 [ Gt | St=s ] The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ] !

  • The state-value function can again be decomposed into immediate reward plus discounted value of successor state v_{\pi}(s) = E_{\pi}[R_{t+1} + v_{\pi}(S_{t+1}) | St = s]
  • The action-value function can similarly be decomposed q_{\pi}(s, a) = E_{\pi}[R_{t+1} + q_{\pi}(S_{t+1}, A_{t+1}) | St = s, At = a] !! ! putting all together (very important, remeber it)

!

! as we can see, an action does not necessarily bring to a specific state.

Example: Gridworld 2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso ! ! ! C is low because from C I get reward of 0 everywhere I go.