3.6 KiB
The 10-arms testbed
-
we compare different strategies to assess the relative effectiveness
-
10 actions along the X axis
-
Y axis shows the distribution of rewards
-
Each reward is sampled from a normal distribution with some mean q*(a) and variance=1
-
Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 !
-
q* is randomly sampled from a normal distribution
-
rewards are randomly sampled based on q
-
actions are randomly taken on exploration steps
-
to fairly compare different methods we need to perform many independent run
-
for any learning method we measure its performance over 2000 independent runs
Experiments
- run experiments for different epsilons
- 0
- 0.01
- 0.1
- exploring more I find the best actions
- exploring less it will converge slowly
- not exploring may never find the best action(s)
Let's do the same experiment starting with optimistic initial values
- we start with a high value for the rewards
- we set q1(a) = +5 for all actions
!
as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!
Optimistic initial value method:
- explores more at the beginning
- only effective for stationary problems
- for non-stationary problems we have to use eps-greedy
Optimism in the Face of Uncertainty - Upper Confidence Bound (UCB)
- ...
- easy problem:
- two arms, one always good and one always bad
- try both and done
- hard problem:
- arm much better than other one but there is much noise
- takes really long time to disambiguate
!
which actions should we peek?
- greedy would peek the green one
- eps-greedy too
- optimism in the face of uncertainty says:
- the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
- principle: do not take the arm you believe is best, take the one which has the most potential to be the best
!
the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.
The problem is that, when a region is large, we don't know where the average value is! So we have to try!
If region is very small, we are very certain!
!
In this situation we chose Q2 as estimated value is the highest.
!
But in this case Q1.
Action selection
-
We will select the action that has the highest estimated value plus the upper-confidence bound exploration term
-
The c parameter is a user-specified parameter that controls the amount of exploration
-
N_{t}(a)
is the number of times an actiona
has been taken -
to systematically reduce uncertainity, UCB explores more at the beginning
-
UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time
AI generated summary
In the end, we can say that UCB is an effective strategy for balancing exploration and exploitation in multi-armed bandit problems. Unlike ε-greedy methods, which maintain a fixed level of exploration throughout the process, UCB dynamically adjusts its exploration rate based on the uncertainty associated with each action's value estimates. This adaptability makes UCB particularly well-suited for scenarios where initial exploration is crucial to quickly identify high-reward actions but later iterations require more focused exploitation.