master-degree-notes/Autonomous Networking/notes/7.2 10 arm testbed - optimism in face of uncertainty.md
2024-11-02 16:28:37 +01:00

3.6 KiB

The 10-arms testbed

  • we compare different strategies to assess the relative effectiveness

  • 10 actions along the X axis

  • Y axis shows the distribution of rewards

  • Each reward is sampled from a normal distribution with some mean q*(a) and variance=1

  • Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 !

  • q* is randomly sampled from a normal distribution

  • rewards are randomly sampled based on q

  • actions are randomly taken on exploration steps

  • to fairly compare different methods we need to perform many independent run

  • for any learning method we measure its performance over 2000 independent runs

!

!

Experiments

  • run experiments for different epsilons
  • 0
  • 0.01
  • 0.1

!

  • exploring more I find the best actions
  • exploring less it will converge slowly
  • not exploring may never find the best action(s)

Let's do the same experiment starting with optimistic initial values

  • we start with a high value for the rewards
  • we set q1(a) = +5 for all actions ! as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!

Optimistic initial value method:

  • explores more at the beginning
  • only effective for stationary problems
    • for non-stationary problems we have to use eps-greedy

Optimism in the Face of Uncertainty - Upper Confidence Bound (UCB)

  • ...
  • easy problem:
    • two arms, one always good and one always bad
    • try both and done
  • hard problem:
    • arm much better than other one but there is much noise
    • takes really long time to disambiguate

! which actions should we peek?

  • greedy would peek the green one
  • eps-greedy too
  • optimism in the face of uncertainty says:
    • the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
    • principle: do not take the arm you believe is best, take the one which has the most potential to be the best

! the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region. The problem is that, when a region is large, we don't know where the average value is! So we have to try!

If region is very small, we are very certain!

! In this situation we chose Q2 as estimated value is the highest. ! But in this case Q1.

Action selection

!

  • We will select the action that has the highest estimated value plus the upper-confidence bound exploration term

  • The c parameter is a user-specified parameter that controls the amount of exploration

  • N_{t}(a) is the number of times an action a has been taken

  • to systematically reduce uncertainity, UCB explores more at the beginning

  • UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time

AI generated summary

In the end, we can say that UCB is an effective strategy for balancing exploration and exploitation in multi-armed bandit problems. Unlike ε-greedy methods, which maintain a fixed level of exploration throughout the process, UCB dynamically adjusts its exploration rate based on the uncertainty associated with each action's value estimates. This adaptability makes UCB particularly well-suited for scenarios where initial exploration is crucial to quickly identify high-reward actions but later iterations require more focused exploitation.