vault backup: 2024-11-14 01:31:24

2024-11-14 01:31:24 +01:00 · 2024-11-14 01:31:24 +01:00 · b21b11655a
commit b21b11655a
parent bc43175468
12 changed files with 145 additions and 29 deletions
--- a/Networking/notes/10
+++ b/Networking/notes/10
@ -0,0 +1,20 @@
+Let's define the Bellman Optimality Equation: $v(s)=$ $( v^*(s) = \max_a \sum_{s', r} p(s',r|s,a)[r + \gamma v^*(s')])$, where $(v^*(s))$ represents the optimal value function for state \(s\).
+
+The equation is not closed, but there are several ways to compute it:
+- dynamic programming
+- ...
+- Q-Learning
+
+Q-Learning is an iterative way to do it, that learns the optimal values "online".
+
+Q-Learning is the fusion of TD-Learning and Off Policy.
+
+#### Temporal Difference Learning
+At each step the state value is updates:
+$$V(S_{t})=V(S_{t})+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_{t})]$$
+
+This is a special case of the $TD(\lambda)$ called $TD(0)$, or one-step TD. It works by updating the previous estimate every time an action has been taken.
+
+#### Q-Learning
+$$Q(S, a)=Q(S, a)+\alpha(R+\gamma max_{a'}(Q(S', a')-Q(S, a))$$
+This will converge to the optimal action value function.