76 lines
2.7 KiB
Markdown
76 lines
2.7 KiB
Markdown
|
### The 10-arms testbed
|
||
|
- we compare different strategies to assess the relative effectiveness
|
||
|
- 10 actions along the X axis
|
||
|
- Y axis shows the distribution of rewards
|
||
|
|
||
|
- Each reward is sampled from a normal distribution with some mean q*(a) and variance=1
|
||
|
- Each q*(a) is drawn from a normal distribution with mean=0 and variance=1
|
||
|
![[Pasted image 20241025084609.png]]
|
||
|
|
||
|
- q* is randomly sampled from a normal distribution
|
||
|
- rewards are randomly sampled based on q
|
||
|
- actions are randomly taken on exploration steps
|
||
|
- to fairly compare different methods we need to perform many independent run
|
||
|
- for any learning method we measure its performance over 2000 independent runs
|
||
|
|
||
|
![[Pasted image 20241025084755.png]]
|
||
|
|
||
|
.. add siled ...
|
||
|
![[Pasted image 20241025084830.png]]
|
||
|
|
||
|
#### Experiments
|
||
|
- run experiments for different epsilons
|
||
|
- 0
|
||
|
- 0.01
|
||
|
- 0.1
|
||
|
|
||
|
![[Pasted image 20241025084938.png]]
|
||
|
- exploring more I find the best actions
|
||
|
- exploring less it will converge slowly
|
||
|
- not exploring may never find the best action(s)
|
||
|
|
||
|
Let's do the same experiment starting with optimistic initial values
|
||
|
- we start with a high value for the rewards
|
||
|
- we set q1(a) = +5 for all actions
|
||
|
![[Pasted image 20241025085237.png]]
|
||
|
as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!
|
||
|
|
||
|
**Optimistic initial value method:**
|
||
|
- explores more at the beginning
|
||
|
- only effective for stationary problems
|
||
|
- for non-stationary problems we have to use eps-greedy
|
||
|
|
||
|
### Optimism in the Face of Uncertainty
|
||
|
- ...
|
||
|
- easy problem:
|
||
|
- two arms, one always good and one always bad
|
||
|
- try both and done
|
||
|
- hard problem:
|
||
|
- arm much better than other one but there is much noise
|
||
|
- takes really long time to disambiguate
|
||
|
|
||
|
![[Pasted image 20241025085759.png]]
|
||
|
which actions should we peek?
|
||
|
- greedy would peek the green one
|
||
|
- eps-greedy too
|
||
|
- optimism in the face of uncertainty says:
|
||
|
- the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
|
||
|
- principle: *do not take the arm you believe is best, take the one which has the most potential to be the best*
|
||
|
|
||
|
![[Pasted image 20241025090344.png]]
|
||
|
the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.
|
||
|
|
||
|
If region is very small, we are very certain!
|
||
|
|
||
|
![[Pasted image 20241025090549.png]]
|
||
|
In this situation we chose Q2 as estimated value is the highest.
|
||
|
|
||
|
#### Action selection
|
||
|
![[Pasted image 20241025090625.png]]
|
||
|
... check slides for formula explaination ...
|
||
|
|
||
|
|
||
|
- to systematically reduce uncertainity, UCB explores more at the beginning
|
||
|
- UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time
|
||
|
|