master-degree-notes/Autonomous Networking/notes/8.md

2.7 KiB

The 10-arms testbed

  • we compare different strategies to assess the relative effectiveness

  • 10 actions along the X axis

  • Y axis shows the distribution of rewards

  • Each reward is sampled from a normal distribution with some mean q*(a) and variance=1

  • Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 !

  • q* is randomly sampled from a normal distribution

  • rewards are randomly sampled based on q

  • actions are randomly taken on exploration steps

  • to fairly compare different methods we need to perform many independent run

  • for any learning method we measure its performance over 2000 independent runs

!

.. add siled ... !

Experiments

  • run experiments for different epsilons
  • 0
  • 0.01
  • 0.1

!

  • exploring more I find the best actions
  • exploring less it will converge slowly
  • not exploring may never find the best action(s)

Let's do the same experiment starting with optimistic initial values

  • we start with a high value for the rewards
  • we set q1(a) = +5 for all actions ! as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!

Optimistic initial value method:

  • explores more at the beginning
  • only effective for stationary problems
    • for non-stationary problems we have to use eps-greedy

Optimism in the Face of Uncertainty

  • ...
  • easy problem:
    • two arms, one always good and one always bad
    • try both and done
  • hard problem:
    • arm much better than other one but there is much noise
    • takes really long time to disambiguate

! which actions should we peek?

  • greedy would peek the green one
  • eps-greedy too
  • optimism in the face of uncertainty says:
    • the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
    • principle: do not take the arm you believe is best, take the one which has the most potential to be the best

! the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.

If region is very small, we are very certain!

! In this situation we chose Q2 as estimated value is the highest.

Action selection

! ... check slides for formula explaination ...

  • to systematically reduce uncertainity, UCB explores more at the beginning
  • UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time