master-degree-notes/Autonomous Networking/notes/8.md

2.7 KiB

The 10-arms testbed

  • we compare different strategies to assess the relative effectiveness

  • 10 actions along the X axis

  • Y axis shows the distribution of rewards

  • Each reward is sampled from a normal distribution with some mean q*(a) and variance=1

  • Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 !Pasted image 20241025084609.png

  • q* is randomly sampled from a normal distribution

  • rewards are randomly sampled based on q

  • actions are randomly taken on exploration steps

  • to fairly compare different methods we need to perform many independent run

  • for any learning method we measure its performance over 2000 independent runs

!Pasted image 20241025084755.png

.. add siled ... !Pasted image 20241025084830.png

Experiments

  • run experiments for different epsilons
  • 0
  • 0.01
  • 0.1

!Pasted image 20241025084938.png

  • exploring more I find the best actions
  • exploring less it will converge slowly
  • not exploring may never find the best action(s)

Let's do the same experiment starting with optimistic initial values

  • we start with a high value for the rewards
  • we set q1(a) = +5 for all actions !Pasted image 20241025085237.png as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!

Optimistic initial value method:

  • explores more at the beginning
  • only effective for stationary problems
    • for non-stationary problems we have to use eps-greedy

Optimism in the Face of Uncertainty

  • ...
  • easy problem:
    • two arms, one always good and one always bad
    • try both and done
  • hard problem:
    • arm much better than other one but there is much noise
    • takes really long time to disambiguate

!Pasted image 20241025085759.png which actions should we peek?

  • greedy would peek the green one
  • eps-greedy too
  • optimism in the face of uncertainty says:
    • the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
    • principle: do not take the arm you believe is best, take the one which has the most potential to be the best

!Pasted image 20241025090344.png the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.

If region is very small, we are very certain!

!Pasted image 20241025090549.png In this situation we chose Q2 as estimated value is the highest.

Action selection

!Pasted image 20241025090625.png ... check slides for formula explaination ...

  • to systematically reduce uncertainity, UCB explores more at the beginning
  • UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time