vault backup: 2024-11-14 01:31:24
This commit is contained in:
parent
bc43175468
commit
b21b11655a
12 changed files with 145 additions and 29 deletions
20
Autonomous Networking/notes/10 Q-Learning.md
Normal file
20
Autonomous Networking/notes/10 Q-Learning.md
Normal file
|
@ -0,0 +1,20 @@
|
|||
Let's define the Bellman Optimality Equation: $v(s)=$ $( v^*(s) = \max_a \sum_{s', r} p(s',r|s,a)[r + \gamma v^*(s')])$, where $(v^*(s))$ represents the optimal value function for state \(s\).
|
||||
|
||||
The equation is not closed, but there are several ways to compute it:
|
||||
- dynamic programming
|
||||
- ...
|
||||
- Q-Learning
|
||||
|
||||
Q-Learning is an iterative way to do it, that learns the optimal values "online".
|
||||
|
||||
Q-Learning is the fusion of TD-Learning and Off Policy.
|
||||
|
||||
#### Temporal Difference Learning
|
||||
At each step the state value is updates:
|
||||
$$V(S_{t})=V(S_{t})+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_{t})]$$
|
||||
|
||||
This is a special case of the $TD(\lambda)$ called $TD(0)$, or one-step TD. It works by updating the previous estimate every time an action has been taken.
|
||||
|
||||
#### Q-Learning
|
||||
$$Q(S, a)=Q(S, a)+\alpha(R+\gamma max_{a'}(Q(S', a')-Q(S, a))$$
|
||||
This will converge to the optimal action value function.
|
Loading…
Add table
Add a link
Reference in a new issue