master-degree-notes/Autonomous Networking/notes/8.md

### The 10-arms testbed
- we compare different strategies to assess the relative effectiveness
- 10 actions along the X axis
- Y axis shows the distribution of rewards

-  Each reward is sampled from a normal distribution with some mean q*(a) and variance=1
- Each q*(a) is drawn from a normal distribution with mean=0 and variance=1
![[Pasted image 20241025084609.png]]

- q* is randomly sampled from a normal distribution
- rewards are randomly sampled based on q
- actions are randomly taken on exploration steps
- to fairly compare different methods we need to perform many independent run
- for any learning method we measure its performance over 2000 independent runs

![[Pasted image 20241025084755.png]]

.. add siled ...
![[Pasted image 20241025084830.png]]

#### Experiments
- run experiments for different epsilons
- 0
- 0.01
- 0.1

![[Pasted image 20241025084938.png]]
- exploring more I find the best actions
- exploring less it will converge slowly
- not exploring may never find the best action(s)

Let's do the same experiment starting with optimistic initial values
- we start with a high value for the rewards
- we set q1(a) = +5 for all actions
![[Pasted image 20241025085237.png]]
as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!

**Optimistic initial value method:**
- explores more at the beginning
- only effective for stationary problems
	- for non-stationary problems we have to use eps-greedy

### Optimism in the Face of Uncertainty
- ...
- easy problem:
	- two arms, one always good and one always bad
	- try both and done
- hard problem:
	- arm much better than other one but there is much noise
	- takes really long time to disambiguate

![[Pasted image 20241025085759.png]]
which actions should we peek?
- greedy would peek the green one
- eps-greedy too
- optimism in the face of uncertainty says:
	- the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!
	- principle: *do not take the arm you believe is best, take the one which has the most potential to be the best*

![[Pasted image 20241025090344.png]]
the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.

If region is very small, we are very certain!

![[Pasted image 20241025090549.png]]
In this situation we chose Q2 as estimated value is the highest.

#### Action selection
![[Pasted image 20241025090625.png]]
... check slides for formula explaination ...


- to systematically reduce uncertainity, UCB explores more at the beginning
- UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time
vault backup: 2024-10-25 09:24:27 2024-10-25 09:24:27 +02:00			`### The 10-arms testbed`
			`- we compare different strategies to assess the relative effectiveness`
			`- 10 actions along the X axis`
			`- Y axis shows the distribution of rewards`

			`- Each reward is sampled from a normal distribution with some mean q*(a) and variance=1`
			`- Each q*(a) is drawn from a normal distribution with mean=0 and variance=1`
			`![[Pasted image 20241025084609.png]]`

			`- q* is randomly sampled from a normal distribution`
			`- rewards are randomly sampled based on q`
			`- actions are randomly taken on exploration steps`
			`- to fairly compare different methods we need to perform many independent run`
			`- for any learning method we measure its performance over 2000 independent runs`

			`![[Pasted image 20241025084755.png]]`

			`.. add siled ...`
			`![[Pasted image 20241025084830.png]]`

			`#### Experiments`
			`- run experiments for different epsilons`
			`- 0`
			`- 0.01`
			`- 0.1`

			`![[Pasted image 20241025084938.png]]`
			`- exploring more I find the best actions`
			`- exploring less it will converge slowly`
			`- not exploring may never find the best action(s)`

			`Let's do the same experiment starting with optimistic initial values`
			`- we start with a high value for the rewards`
			`- we set q1(a) = +5 for all actions`
			`![[Pasted image 20241025085237.png]]`
			`as we can see, the system explores more at the beginning, which is good as it will find the best actions to take sooner!`

			`Optimistic initial value method:`
			`- explores more at the beginning`
			`- only effective for stationary problems`
			`- for non-stationary problems we have to use eps-greedy`

			`### Optimism in the Face of Uncertainty`
			`- ...`
			`- easy problem:`
			`- two arms, one always good and one always bad`
			`- try both and done`
			`- hard problem:`
			`- arm much better than other one but there is much noise`
			`- takes really long time to disambiguate`

			`![[Pasted image 20241025085759.png]]`
			`which actions should we peek?`
			`- greedy would peek the green one`
			`- eps-greedy too`
			`- optimism in the face of uncertainty says:`
			`- the more uncertain we are about an action-value, the more it is to explore that action, as it could turn out to be the best!`
			`- principle: do not take the arm you believe is best, take the one which has the most potential to be the best`

			`![[Pasted image 20241025090344.png]]`
			`the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.`

			`If region is very small, we are very certain!`

			`![[Pasted image 20241025090549.png]]`
			`In this situation we chose Q2 as estimated value is the highest.`

			`#### Action selection`
			`![[Pasted image 20241025090625.png]]`
			`... check slides for formula explaination ...`


			`- to systematically reduce uncertainity, UCB explores more at the beginning`
			`- UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time`