This commit is contained in:
Marco Realacci 2024-11-02 16:28:37 +01:00
parent be7844b4f3
commit eea09ec9b8
15 changed files with 35749 additions and 63 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

View file

@ -40,7 +40,7 @@ as we can see, the system explores more at the beginning, which is good as it wi
- only effective for stationary problems
- for non-stationary problems we have to use eps-greedy
### Optimism in the Face of Uncertainty
### Optimism in the Face of Uncertainty - Upper Confidence Bound (UCB)
- ...
- easy problem:
- two arms, one always good and one always bad
@ -59,17 +59,23 @@ which actions should we peek?
![[Pasted image 20241025090344.png]]
the brackets represent a confidence interval around q*(a). The system is confident that the value lies somewhere in the region.
The problem is that, when a region is large, we don't know where the average value is! So we have to try!
If region is very small, we are very certain!
![[Pasted image 20241025090549.png]]
In this situation we chose Q2 as estimated value is the highest.
![[Pasted image 20241031144640.png]]
But in this case Q1.
#### Action selection
![[Pasted image 20241025090625.png]]
... check slides for formula explaination ...
- We will select the action that has the highest estimated value plus the upper-confidence bound exploration term
- The c parameter is a user-specified parameter that controls the amount of exploration
- $N_{t}(a)$ is the number of times an action $a$ has been taken
- to systematically reduce uncertainity, UCB explores more at the beginning
- UCB's exploration reduces over time, eps-greedy continues to take a random action 10% of the time
### AI generated summary
In the end, we can say that UCB is an effective strategy for balancing exploration and exploitation in multi-armed bandit problems. Unlike ε-greedy methods, which maintain a fixed level of exploration throughout the process, UCB dynamically adjusts its exploration rate based on the uncertainty associated with each action's value estimates. This adaptability makes UCB particularly well-suited for scenarios where initial exploration is crucial to quickly identify high-reward actions but later iterations require more focused exploitation.

View file

@ -57,7 +57,7 @@ This is a Markov Process but we also have a reward function! We also have a disc
Value function
- The value function v(s) gives the long-term value of (being in) state s
- The state value function v(s) of an MRP is the expected return starting from state s $𝑉) = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]$
- The state value function v(s) of an MRP is the expected return starting from state s $𝑉 = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]$
![[Pasted image 20241030103519.png]]
![[Pasted image 20241030103706.png]]
@ -111,8 +111,8 @@ The state-value function v𝜋(s) of an MDP is the expected return starting from
The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 $$q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]$$
![[Pasted image 20241030105022.png]]
- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v\pi(s) = E\pi[Rt+1 + v⇡(St+1) | St = s]$$
- The action-value function can similarly be decomposed $$q\pi(s, a) = E\pi [Rt+1 + q⇡(St+1, At+1) | St = s, At = a]$$
- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v_{\pi}(s) = E_{\pi}[R_{t+1} + v_{\pi}(S_{t+1}) | St = s]$$
- The action-value function can similarly be decomposed $$q_{\pi}(s, a) = E_{\pi}[R_{t+1} + q_{\pi}(S_{t+1}, A_{t+1}) | St = s, At = a]$$
![[Pasted image 20241030105148.png]]![[Pasted image 20241030105207.png]]
![[Pasted image 20241030105216.png]]
putting all together