### The 10-arms testbed - we compare different strategies to assess the relative effectiveness - 10 actions along the X axis - Y axis shows the distribution of rewards - Each reward is sampled from a normal distribution with some mean q*(a) and variance=1 - Each q*(a) is drawn from a normal distribution with mean=0 and variance=1 ![[Pasted image 20241025084609.png]] - q* is randomly sampled from a normal distribution - rewards are randomly sampled based on q - actions are randomly taken on exploration steps - to fairly compare different methods we need to perform many independent run - for any learning method we measure its performance over 2000 independent runs ![[Pasted image 20241025084755.png]] .. add siled ... ![[Pasted image 20241025084830.png]] #### Experiments - run experiments for different epsilons - 0 - 0.01 - 0.1 ![[Pasted image 20241025084938.png]] - exploring more I find the best actions - exploring less it will converge slowly - not exploring may never find the best action(s) Let's do the same experiment starting with optimistic initial values - we start with a high value for the rewards - we set q1(a) = +5 for all actions ![[Pasted image 20241025085237.png]] as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner! **Optimistic initial value method:** - explores more at the beginning - only effective for stationary problems - for non-stationary problems we have to use eps-greedy ### Optimism in the Face of Uncertainty - ... - easy problem: - two arms, one always good and one always bad - try both and done - hard problem: - arm much better than other one but there is much noise - takes really long time to disambiguate ![[Pasted image 20241025085759.png]] which actions should we peek? - greedy would peek the green one - eps-greedy too - optimism in the face of uncertainty says: - the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best! - principle: *do not take the arm you believe is best, take the one which has the most potential to be the best* ![[Pasted image 20241025090344.png]] the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region. If region is very small, we are very certain! ![[Pasted image 20241025090549.png]] In this situation we chose Q2 as estimated value is the highest. #### Action selection ![[Pasted image 20241025090625.png]] ... check slides for formula explaination ... - to systematically reduce uncertainity, UCB explores more at the beginning - UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time