master-degree-notes/Autonomous Networking/notes/10 Q-Learning.md at main

Marco Realacci b21b11655a vault backup: 2024-11-14 01:31:24

2024-11-14 01:31:24 +01:00

881 B

Raw Permalink Blame History

Let's define the Bellman Optimality Equation: v(s)= ( v^*(s) = \max_a \sum_{s', r} p(s',r|s,a)[r + \gamma v^*(s')]), where (v^*(s)) represents the optimal value function for state s.

The equation is not closed, but there are several ways to compute it:

dynamic programming
...
Q-Learning

Q-Learning is an iterative way to do it, that learns the optimal values "online".

Q-Learning is the fusion of TD-Learning and Off Policy.

Temporal Difference Learning

At each step the state value is updates:

V(S_{t})=V(S_{t})+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_{t})]

This is a special case of the TD(\lambda) called TD(0), or one-step TD. It works by updating the previous estimate every time an action has been taken.

Q-Learning

Q(S, a)=Q(S, a)+\alpha(R+\gamma max_{a'}(Q(S', a')-Q(S, a))

This will converge to the optimal action value function.

881 B Raw Permalink Blame History

Temporal Difference Learning

Q-Learning

881 B

Raw Permalink Blame History