881 B
881 B
Let's define the Bellman Optimality Equation: v(s)=
( v^*(s) = \max_a \sum_{s', r} p(s',r|s,a)[r + \gamma v^*(s')])
, where (v^*(s))
represents the optimal value function for state s
.
The equation is not closed, but there are several ways to compute it:
- dynamic programming
- ...
- Q-Learning
Q-Learning is an iterative way to do it, that learns the optimal values "online".
Q-Learning is the fusion of TD-Learning and Off Policy.
Temporal Difference Learning
At each step the state value is updates:
V(S_{t})=V(S_{t})+\alpha[R_{t+1}+\gamma V(S_{t+1})-V(S_{t})]
This is a special case of the TD(\lambda)
called TD(0)
, or one-step TD. It works by updating the previous estimate every time an action has been taken.
Q-Learning
Q(S, a)=Q(S, a)+\alpha(R+\gamma max_{a'}(Q(S', a')-Q(S, a))
This will converge to the optimal action value function.