master-degree-notes/Autonomous Networking/notes/7.2 10 arm testbed - optimism in face of
2024-11-02 16:28:37 +01:00

3.6 KiB

The 10-arms testbed

  • we compare different strategies to assess the relative effectiveness

  • 10 actions along the X axis

  • Y axis shows the distribution of rewards

  • Each reward is sampled from a normal distribution with some mean q*(a) and variance=1

  • Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 !Pasted image 20241025084609.png

  • q* is randomly sampled from a normal distribution

  • rewards are randomly sampled based on q

  • actions are randomly taken on exploration steps

  • to fairly compare different methods we need to perform many independent run

  • for any learning method we measure its performance over 2000 independent runs

!Pasted image 20241025084755.png

!Pasted image 20241025084830.png


  • run experiments for different epsilons
  • 0
  • 0.01
  • 0.1

!Pasted image 20241025084938.png

  • exploring more I find the best actions
  • exploring less it will converge slowly
  • not exploring may never find the best action(s)

Let's do the same experiment starting with optimistic initial values

  • we start with a high value for the rewards
  • we set q1(a) = +5 for all actions !Pasted image 20241025085237.png as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!

Optimistic initial value method:

  • explores more at the beginning
  • only effective for stationary problems
    • for non-stationary problems we have to use eps-greedy

Optimism in the Face of Uncertainty - Upper Confidence Bound (UCB)

  • ...
  • easy problem:
    • two arms, one always good and one always bad
    • try both and done
  • hard problem:
    • arm much better than other one but there is much noise
    • takes really long time to disambiguate

!Pasted image 20241025085759.png which actions should we peek?

  • greedy would peek the green one
  • eps-greedy too
  • optimism in the face of uncertainty says:
    • the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
    • principle: do not take the arm you believe is best, take the one which has the most potential to be the best

!Pasted image 20241025090344.png the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region. The problem is that, when a region is large, we don't know where the average value is! So we have to try!

If region is very small, we are very certain!

!Pasted image 20241025090549.png In this situation we chose Q2 as estimated value is the highest. !Pasted image 20241031144640.png But in this case Q1.

Action selection

!Pasted image 20241025090625.png

  • We will select the action that has the highest estimated value plus the upper-confidence bound exploration term

  • The c parameter is a user-specified parameter that controls the amount of exploration

  • N_{t}(a) is the number of times an action a has been taken

  • to systematically reduce uncertainity, UCB explores more at the beginning

  • UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time

AI generated summary

In the end, we can say that UCB is an effective strategy for balancing exploration and exploitation in multi-armed bandit problems. Unlike ε-greedy methods, which maintain a fixed level of exploration throughout the process, UCB dynamically adjusts its exploration rate based on the uncertainty associated with each action's value estimates. This adaptability makes UCB particularly well-suited for scenarios where initial exploration is crucial to quickly identify high-reward actions but later iterations require more focused exploitation.