2nov2024
This commit is contained in:
parent
be7844b4f3
commit
eea09ec9b8
15 changed files with 35749 additions and 63 deletions
|
@ -40,7 +40,7 @@ as we can see, the system explores more at the beginning, which is good as it wi
|
|||
- only effective for stationary problems
|
||||
- for non-stationary problems we have to use eps-greedy
|
||||
|
||||
### Optimism in the Face of Uncertainty
|
||||
### Optimism in the Face of Uncertainty - Upper Confidence Bound (UCB)
|
||||
- ...
|
||||
- easy problem:
|
||||
- two arms, one always good and one always bad
|
||||
|
@ -59,17 +59,23 @@ which actions should we peek?
|
|||
|
||||
![[Pasted image 20241025090344.png]]
|
||||
the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.
|
||||
The problem is that, when a region is large, we don't know where the average value is! So we have to try!
|
||||
|
||||
If region is very small, we are very certain!
|
||||
|
||||
![[Pasted image 20241025090549.png]]
|
||||
In this situation we chose Q2 as estimated value is the highest.
|
||||
|
||||
![[Pasted image 20241031144640.png]]
|
||||
But in this case Q1.
|
||||
#### Action selection
|
||||
![[Pasted image 20241025090625.png]]
|
||||
... check slides for formula explaination ...
|
||||
- We will select the action that has the highest estimated value plus the upper-confidence bound exploration term
|
||||
- The c parameter is a user-specified parameter that controls the amount of exploration
|
||||
- $N_{t}(a)$ is the number of times an action $a$ has been taken
|
||||
|
||||
|
||||
- to systematically reduce uncertainity, UCB explores more at the beginning
|
||||
- UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time
|
||||
|
||||
### AI generated summary
|
||||
In the end, we can say that UCB is an effective strategy for balancing exploration and exploitation in multi-armed bandit problems. Unlike ε-greedy methods, which maintain a fixed level of exploration throughout the process, UCB dynamically adjusts its exploration rate based on the uncertainty associated with each action's value estimates. This adaptability makes UCB particularly well-suited for scenarios where initial exploration is crucial to quickly identify high-reward actions but later iterations require more focused exploitation.
|
Loading…
Add table
Add a link
Reference in a new issue