master-degree-notes/Autonomous Networking/notes/7.1 K-Armed bandit problem.md

### K-Armed bandit problem
- Rewards evaluate actions taken
- evaluative feedback depends on the action taken
- no active exploration

Let's consider a simplified version of an RL problem: K-armed bandit problem.
- K different options
- every time need to chose one
- maximize expected total reward over some time period
- analogy with slot machines
	- the levers are the actions
	- which lever gives the highest reward?
- **Formalization**
	- set of actions A (or "arms")
	- reward function R that follows an unknown probability distributions
	- only one state
	- at each step t, agent selects an action A
	- environment generates reward
	- goal to maximize cumulative reward

Example: doctor treatment
- doctor has 3 treatments (actions), each of them has a reward.
- for the doctor to decide which action to take is best, we must define the value of taking each action
- we call these values the action values (or action value function)
- **action value:** $$q_{*}=E[R_{t} \mid A_{t}=a]$$
Each action has a reward defined by a probability distribution.
- the red treatment has a Bernoulli probability
- the yellow treatment binomial
- the blue uniform
- the agent does not know the distributions!
![[Pasted image 20241030165705.png]]

- **the estimated action value Q** for action a is the sum of rewards observed divided by the total time the action has been taken

- **greedy action:**
	- doctors assign the treatment they currently think is the best
	- greedy action is the action that currently has the largest estimated action value $$A_{t}=argmax(Q_{t}(a))$$
	- greedy always exploits current knowledge
- **epsilon-greedy:**
	- with a probability epsilon sometimes we explore
		- 1-eps probability: we chose best greedy action
		- eps probability: we chose random action

#### Exercise 1
In ε-greedy action selection, for the case of two actions and ε=0.5, what is the probability that the greedy action is selected?
*We have two actions. The probability of selecting the greedy action is 50%.
But when the exploration happens, the greedy actions may be selected!
So, with 0.5 prob. we select greedy action. With 0.5 prob. we select random action, which can be both. So in the random case, we select the greedy action with 0.5 * 0.5 probability = 0.25.
Finally, we select the greedy action with 0.5 + 0.25 = 0.75 probability.

#### Exercise 2
Consider K-armed bandit problem.
K = 4 actions, denoted 1,2,3 and 4
Agent uses eps-greedy action selection
initial Q estimantes is 0 for all actions: $$Q_{1}(a)=0$$
Initial sequenze of actions and rewards is:
A1 = 1   R1 = 1
A2 = 2  R2 = 2
A3 = 2  R3 = 2
A4 = 2  R4 = 2
A5 = 3  R5 = 0

On some of those time steps, the epsilon case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur?
On which time steps could this possibly have occurred?

***Answer***
to answer, we need to compute the action sequence

in the table, Qa means Q value of the action a at current state!

| steps          | Q1  | Q2   | Q3  | Q4  |
| -------------- | --- | ---- | --- | --- |
| A1 \| action 1 | 1   | 0    | 0   | 0   |
| A2 \| action 2 | 1   | 1    | 0   | 0   |
| A3 \| action 2 | 1   | 1.5  | 0   | 0   |
| A4 \| action 2 | 1   | 1.66 | 0   | 0   |
| A5 \| action 3 | 1   | 1.66 | 0   | 0   |


step A1: action 1 selected. Q of action 1 is 1
step A2: action 2 selected. Q(1) = 1, Q(2) = 1
step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5
step A4: action 2. Q(1) = 1, Q(2) = 1.6
step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0

For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value.
A3 and A4 can be both greedy and epsilon case.

#### Incremental formula to estimate action-value
- idea: compute incrementally the action values, to avoid doing it every time
- to simplify notation we concentrate on a single action on the next examples
	- $R_{i}$ denotes the reward received after the i(th) selection of this action.
	- $Q_{n}$ denotes the estimate of its action value after it has been selected n-1 times $$Q_{n}=\frac{R_{1}+R_{2}+\dots+R_{n-1}}{n-1}$$
- given $Q_{n}$ and the reward Rn, the new average of rewards can be computed by $$Q_{n+1}=\frac{1}{n}\sum_{i=1}^nR_{i}$$
General formula: NewEstimate <- OldEstimate + StepSize (Target - OldEstimate) $$Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]$$Target - OldEstimate is the error

Pseudocode for bandit algorithm:
```
Initialize for a = 1 to k:
	Q(a) = 0
	N(a) = 0
Loop forever:
	with probability 1-eps:
		A = argmax_a(Q(a))
	else:
		A = random action
	R = bandit(A) # returns the reward of the action A
	N(A) = N(A) + 1
	Q(A) = Q(A) + 1\N(A) * (R - Q(A))
```

#### Nonstationary problem:
Rewards probabilities change over time.
- in the doctor example, a treatment may not be good in all conditions
- the agent (doctor) is unaware of the changes, he would like to adapt to it, maybe a treatment is good only on a specific season.

An option is to use a fixed step size. We remove the 1/n factor and add an $\alpha$ constant factor between 0 and 1.
And we get $$Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}$$
#### Optimistic initial values
Initial action values can be used as a simple way to encourage exploration!
This way we can make the agent explore more at the beginning, and explore less after a while, this is cool!