122 lines
No EOL
5.2 KiB
Markdown
122 lines
No EOL
5.2 KiB
Markdown
### K-Armed bandit problem
|
|
- Rewards evaluate actions taken
|
|
- evaluative feedback depends on the action taken
|
|
- no active exploration
|
|
|
|
Let's consider a simplified version of an RL problem: K-armed bandit problem.
|
|
- K different options
|
|
- every time need to chose one
|
|
- maximize expected total reward over some time period
|
|
- analogy with slot machines
|
|
- the levers are the actions
|
|
- which lever gives the highest reward?
|
|
- **Formalization**
|
|
- set of actions A (or "arms")
|
|
- reward function R that follows an unknown probability distributions
|
|
- only one state
|
|
- at each step t, agent selects an action A
|
|
- environment generates reward
|
|
- goal to maximize cumulative reward
|
|
|
|
Example: doctor treatment
|
|
- doctor has 3 treatments (actions), each of them has a reward.
|
|
- for the doctor to decide which action to take is best, we must define the value of taking each action
|
|
- we call these values the action values (or action value function)
|
|
- **action value:** $$q_{*}=E[R_{t} \mid A_{t}=a]$$
|
|
Each action has a reward defined by a probability distribution.
|
|
- the red treatment has a Bernoulli probability
|
|
- the yellow treatment binomial
|
|
- the blue uniform
|
|
- the agent does not know the distributions!
|
|
![[Pasted image 20241030165705.png]]
|
|
|
|
- **the estimated action value Q** for action a is the sum of rewards observed divided by the total time the action has been taken
|
|
|
|
- **greedy action:**
|
|
- doctors assign the treatment they currently think is the best
|
|
- greedy action is the action that currently has the largest estimated action value $$A_{t}=argmax(Q_{t}(a))$$
|
|
- greedy always exploits current knowledge
|
|
- **epsilon-greedy:**
|
|
- with a probability epsilon sometimes we explore
|
|
- 1-eps probability: we chose best greedy action
|
|
- eps probability: we chose random action
|
|
|
|
#### Exercise 1
|
|
In ε-greedy action selection, for the case of two actions and ε=0.5, what is the probability that the greedy action is selected?
|
|
*We have two actions. The probability of selecting the greedy action is 50%.
|
|
But when the exploration happens, the greedy actions may be selected!
|
|
So, with 0.5 prob. we select greedy action. With 0.5 prob. we select random action, which can be both. So in the random case, we select the greedy action with 0.5 * 0.5 probability = 0.25.
|
|
Finally, we select the greedy action with 0.5 + 0.25 = 0.75 probability.
|
|
|
|
#### Exercise 2
|
|
Consider K-armed bandit problem.
|
|
K = 4 actions, denoted 1,2,3 and 4
|
|
Agent uses eps-greedy action selection
|
|
initial Q estimantes is 0 for all actions: $$Q_{1}(a)=0$$
|
|
Initial sequenze of actions and rewards is:
|
|
A1 = 1 R1 = 1
|
|
A2 = 2 R2 = 2
|
|
A3 = 2 R3 = 2
|
|
A4 = 2 R4 = 2
|
|
A5 = 3 R5 = 0
|
|
|
|
On some of those time steps, the epsilon case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur?
|
|
On which time steps could this possibly have occurred?
|
|
|
|
***Answer***
|
|
to answer, we need to compute the action sequence
|
|
|
|
in the table, Qa means Q value of the action a at current state!
|
|
|
|
| steps | Q1 | Q2 | Q3 | Q4 |
|
|
| -------------- | --- | ---- | --- | --- |
|
|
| A1 \| action 1 | 1 | 0 | 0 | 0 |
|
|
| A2 \| action 2 | 1 | 1 | 0 | 0 |
|
|
| A3 \| action 2 | 1 | 1.5 | 0 | 0 |
|
|
| A4 \| action 2 | 1 | 1.66 | 0 | 0 |
|
|
| A5 \| action 3 | 1 | 1.66 | 0 | 0 |
|
|
|
|
|
|
|
|
step A1: action 1 selected. Q of action 1 is 1
|
|
step A2: action 2 selected. Q(1) = 1, Q(2) = 1
|
|
step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5
|
|
step A4: action 2. Q(1) = 1, Q(2) = 1.6
|
|
step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0
|
|
|
|
For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value.
|
|
A3 and A4 can be both greedy and epsilon case.
|
|
|
|
#### Incremental formula to estimate action-value
|
|
- idea: compute incrementally the action values, to avoid doing it every time
|
|
- to simplify notation we concentrate on a single action on the next examples
|
|
- $R_{i}$ denotes the reward received after the i(th) selection of this action.
|
|
- $Q_{n}$ denotes the estimate of its action value after it has been selected n-1 times $$Q_{n}=\frac{R_{1}+R_{2}+\dots+R_{n-1}}{n-1}$$
|
|
- given $Q_{n}$ and the reward Rn, the new average of rewards can be computed by $$Q_{n+1}=\frac{1}{n}\sum_{i=1}^nR_{i}$$
|
|
General formula: NewEstimate <- OldEstimate + StepSize (Target - OldEstimate) $$Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]$$Target - OldEstimate is the error
|
|
|
|
Pseudocode for bandit algorithm:
|
|
```
|
|
Initialize for a = 1 to k:
|
|
Q(a) = 0
|
|
N(a) = 0
|
|
Loop forever:
|
|
with probability 1-eps:
|
|
A = argmax_a(Q(a))
|
|
else:
|
|
A = random action
|
|
R = bandit(A) # returns the reward of the action A
|
|
N(A) = N(A) + 1
|
|
Q(A) = Q(A) + 1\N(A) * (R - Q(A))
|
|
```
|
|
|
|
#### Nonstationary problem:
|
|
Rewards probabilities change over time.
|
|
- in the doctor example, a treatment may not be good in all conditions
|
|
- the agent (doctor) is unaware of the changes, he would like to adapt to it, maybe a treatment is good only on a specific season.
|
|
|
|
An option is to use a fixed step size. We remove the 1/n factor and add an $\alpha$ constant factor between 0 and 1.
|
|
And we get $$Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}$$
|
|
#### Optimistic initial values
|
|
Initial action values can be used as a simple way to encourage exploration!
|
|
This way we can make the agent explore more at the beginning, and explore less after a while, this is cool! |