master-degree-notes/Autonomous Networking/notes/7.2 10 arm testbed - optimism in face of uncertainty.md at 4246c6c85e09a48c0e3e240c5c60ac35ab747ae2 - marcorealacci/master-degree-notes

marcorealacci/master-degree-notes

Fork 0

Marco Realacci eea09ec9b8 2nov2024

2024-11-02 16:28:37 +01:00

3.6 KiB

Raw Blame History

The 10-arms testbed

we compare different strategies to assess the relative effectiveness
10 actions along the X axis
Y axis shows the distribution of rewards
Each reward is sampled from a normal distribution with some mean q*(a) and variance=1
Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 !
q* is randomly sampled from a normal distribution
rewards are randomly sampled based on q
actions are randomly taken on exploration steps
to fairly compare different methods we need to perform many independent run
for any learning method we measure its performance over 2000 independent runs

Experiments

run experiments for different epsilons
0
0.01
0.1

exploring more I find the best actions
exploring less it will converge slowly
not exploring may never find the best action(s)

Let's do the same experiment starting with optimistic initial values

we start with a high value for the rewards
we set q1(a) = +5 for all actions ! as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!

Optimistic initial value method:

explores more at the beginning
only effective for stationary problems
- for non-stationary problems we have to use eps-greedy

Optimism in the Face of Uncertainty - Upper Confidence Bound (UCB)

...
easy problem:
- two arms, one always good and one always bad
- try both and done
hard problem:
- arm much better than other one but there is much noise
- takes really long time to disambiguate

! which actions should we peek?

greedy would peek the green one
eps-greedy too
optimism in the face of uncertainty says:
- the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
- principle: do not take the arm you believe is best, take the one which has the most potential to be the best

! the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region. The problem is that, when a region is large, we don't know where the average value is! So we have to try!

If region is very small, we are very certain!

! In this situation we chose Q2 as estimated value is the highest. ! But in this case Q1.

Action selection

We will select the action that has the highest estimated value plus the upper-confidence bound exploration term
The c parameter is a user-specified parameter that controls the amount of exploration
N_{t}(a) is the number of times an action a has been taken
to systematically reduce uncertainity, UCB explores more at the beginning
UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time

AI generated summary

In the end, we can say that UCB is an effective strategy for balancing exploration and exploitation in multi-armed bandit problems. Unlike ε-greedy methods, which maintain a fixed level of exploration throughout the process, UCB dynamically adjusts its exploration rate based on the uncertainty associated with each action's value estimates. This adaptability makes UCB particularly well-suited for scenarios where initial exploration is crucial to quickly identify high-reward actions but later iterations require more focused exploitation.

3.6 KiB Raw Blame History