master-degree-notes/Autonomous Networking/notes/8.md at 4224d5738426df6e7fb33d20428996acc18ce0c2

Marco Realacci 4224d57384 vault backup: 2024-10-25 16:59:19

2024-10-25 16:59:19 +02:00

we compare different strategies to assess the relative effectiveness
10 actions along the X axis
Y axis shows the distribution of rewards
Each reward is sampled from a normal distribution with some mean q*(a) and variance=1
Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 !
q* is randomly sampled from a normal distribution
rewards are randomly sampled based on q
actions are randomly taken on exploration steps
to fairly compare different methods we need to perform many independent run
for any learning method we measure its performance over 2000 independent runs

.. add siled ... !

Let's do the same experiment starting with optimistic initial values

we start with a high value for the rewards
we set q1(a) = +5 for all actions ! as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!

Optimistic initial value method:

explores more at the beginning
only effective for stationary problems
- for non-stationary problems we have to use eps-greedy

...
easy problem:
- two arms, one always good and one always bad
- try both and done
hard problem:
- arm much better than other one but there is much noise
- takes really long time to disambiguate

! which actions should we peek?

greedy would peek the green one
eps-greedy too
optimism in the face of uncertainty says:
- the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
- principle: do not take the arm you believe is best, take the one which has the most potential to be the best

! the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.

If region is very small, we are very certain!

! In this situation we chose Q2 as estimated value is the highest.

! ... check slides for formula explaination ...

to systematically reduce uncertainity, UCB explores more at the beginning
UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time