vault backup: 2024-10-31 13:26:43

This commit is contained in:
Marco Realacci 2024-10-31 13:26:43 +01:00
parent 4aaef01b22
commit 382617bf06
59 changed files with 490 additions and 148 deletions

View file

@ -32,7 +32,7 @@ RL is learning what to do, it presents two main characteristics:
- take actions that affects the state
Difference from other ML
- no supervisor
- **no supervisor**
- feedback may be delayed
- time matters
- agent action affects future decisions
@ -44,13 +44,6 @@ Learning online
- we expect agents to get things wrong, to refine their understanding as they go
- the world is not static, agents continuously encounter new situations
RL applications:
- self driving cars
- engineering
- healthcare
- news recommendation
- ...
Rewards
- a reward is a scalar feedback signal (a number)
- reward Rt indicates how well the agent is doing at step t
@ -63,12 +56,12 @@ communication in battery free environments
- positive rewards if the queried device has new data
- else negative
Challenge:
#### Challenges:
- tradeoff between exploration and exploitation
- to obtain a lot of reward a RL agent must prefer action that it tried in the past
- but better actions may exist... So the agent has to exploit!
exploration vs exploitation dilemma:
##### exploration vs exploitation dilemma:
- comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
- exploitation: we take advanced of the best option we know
- exploration: test new decisions
@ -108,6 +101,7 @@ one or more of these components
- used to evaluate the goodness/badness of states
- values are prediction of rewards
- $V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]$
- better explained later
- **Model:**
- predicts what the environment will do next
- may predict the resultant next state and/or the next reward
@ -124,103 +118,4 @@ back to the original problem:
- positive when querying a device with new data
- negative if it has no data
- what to do if the device has lost data?
- state?
### Exploration vs exploitation trade-off
- Rewards evaluate actions taken
- evaluative feedback depends on the action taken
- no active exploration
Let's consider a simplified version of an RL problem: K-armed bandit problem.
- K different options
- every time need to chose one
- maximize expected total reward over some time period
- analogy with slot machines
- the levers are the actions
- which level gives the highest reward?
- Formalization
- set of actions A (or "arms")
- reward function R that follows an unknown probability distributions
- only one state
- ...
Example: doctor treatment
- doctor has 3 treatments (actions), each of them has a reward.
- for the doctor to decide which action to take is best, we must define the value of taking each action
- we call these values the action values (or action value function)
- action value: ...
Each action has a reward defined by a probability distribution.
- the red treatment has a bernoulli probability
- the yellow treatment binomial
- the blue uniform
- the agent does not know the distributions!
- the estimated action for action a is the sum of rewards observed divided by the total time the action has been taken (add formula ...)
- 1predicate denotes the random variable (1 if true else 0)
- greedy action:
- doctors assign the treatment they currently think is the best
- ...
- the greedy action is computed as the argmax of Q values
- greedy always exploits current knowledge
- epsilon-greedy:
- with a probability epsilon sometimes we explore
- 1-eps probability: we chose best greedy action
- eps probability: we chose random action
exercises ...
exercise 2: k-armed bandit problem.
K = 4 actions, denoted 1,2,3 and 4
eps-greedy selection
initial Q estimantes = 0 for all a.
Initial sequenze of actions and rewards is:
A1 = 1 R1 = 1
A2 = 2 R2 = 2
A3 = 2 R3 = 2
A4 = 2 R4 = 2
A5 = 3 R5 = 0
---
step A1: action 1 selected. Q of action 1 is 1
step A2: action 2 selected. Q(1) = 1, Q(2) = 1
step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5
step A4: action 2. Q(1) = 1, Q(2) = 1.6
step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0
For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value.
A3 and A4 can be both greedy and epsilon case.
#### Incremental formula to estimate action-value
- to simplify notation we concentrate on a single action
- Ri denotes the reward received after the i(th) selection of this action. Qn denotes the estimate of its action value after it has been selected n-1 times (add Qn formula ...)
- given Qn and the reward Rn, the new average of rewards can be computed by (add formula with simplifications...) $Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]$
- NewEstimate <- OldEstimate + StepSize (Target - OldEstimate)
- Target - OldEstimate is the error
Pseudocode for bandit algorithm:
```
Initialize for a = 1 to k:
Q(a) = 0
N(a) = 0
Loop forever:
with probability 1-eps:
A = argmax_a(Q(a))
else:
A = random action
R = bandit(A) # returns the reward of the action A
N(A) = N(A) + 1
Q(A) = Q(A) + 1\N(A) * (R - Q(A))
```
Nonstationary problem: rewards probabilities change over time.
- in the doctor example, a treatment may not be good in all conditions
- the agent (doctor) is unaware of the changes, he would like to adapt to it
An option is to use a fixed step size. We remove the 1/n factor and add an $\alpha$ constant factor between 0 and 1.
And we get $Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}$
... ADD MISSING PART ...
- state?

View file

@ -0,0 +1,122 @@
### K-Armed bandit problem
- Rewards evaluate actions taken
- evaluative feedback depends on the action taken
- no active exploration
Let's consider a simplified version of an RL problem: K-armed bandit problem.
- K different options
- every time need to chose one
- maximize expected total reward over some time period
- analogy with slot machines
- the levers are the actions
- which lever gives the highest reward?
- **Formalization**
- set of actions A (or "arms")
- reward function R that follows an unknown probability distributions
- only one state
- at each step t, agent selects an action A
- environment generates reward
- goal to maximize cumulative reward
Example: doctor treatment
- doctor has 3 treatments (actions), each of them has a reward.
- for the doctor to decide which action to take is best, we must define the value of taking each action
- we call these values the action values (or action value function)
- **action value:** $$q_{*}=E[R_{t} \mid A_{t}=a]$$
Each action has a reward defined by a probability distribution.
- the red treatment has a Bernoulli probability
- the yellow treatment binomial
- the blue uniform
- the agent does not know the distributions!
![[Pasted image 20241030165705.png]]
- **the estimated action value Q** for action a is the sum of rewards observed divided by the total time the action has been taken
- **greedy action:**
- doctors assign the treatment they currently think is the best
- greedy action is the action that currently has the largest estimated action value $$A_{t}=argmax(Q_{t}(a))$$
- greedy always exploits current knowledge
- **epsilon-greedy:**
- with a probability epsilon sometimes we explore
- 1-eps probability: we chose best greedy action
- eps probability: we chose random action
#### Exercise 1
In ε-greedy action selection, for the case of two actions and ε=0.5, what is the probability that the greedy action is selected?
*We have two actions. The probability of selecting the greedy action is 50%.
But when the exploration happens, the greedy actions may be selected!
So, with 0.5 prob. we select greedy action. With 0.5 prob. we select random action, which can be both. So in the random case, we select the greedy action with 0.5 * 0.5 probability = 0.25.
Finally, we select the greedy action with 0.5 + 0.25 = 0.75 probability.
#### Exercise 2
Consider K-armed bandit problem.
K = 4 actions, denoted 1,2,3 and 4
Agent uses eps-greedy action selection
initial Q estimantes is 0 for all actions: $$Q_{1}(a)=0$$
Initial sequenze of actions and rewards is:
A1 = 1 R1 = 1
A2 = 2 R2 = 2
A3 = 2 R3 = 2
A4 = 2 R4 = 2
A5 = 3 R5 = 0
On some of those time steps, the epsilon case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur?
On which time steps could this possibly have occurred?
***Answer***
to answer, we need to compute the action sequence
in the table, Qa means Q value of the action a at current state!
| steps | Q1 | Q2 | Q3 | Q4 |
| -------------- | --- | ---- | --- | --- |
| A1 \| action 1 | 1 | 0 | 0 | 0 |
| A2 \| action 2 | 1 | 1 | 0 | 0 |
| A3 \| action 2 | 1 | 1.5 | 0 | 0 |
| A4 \| action 2 | 1 | 1.66 | 0 | 0 |
| A5 \| action 3 | 1 | 1.66 | 0 | 0 |
step A1: action 1 selected. Q of action 1 is 1
step A2: action 2 selected. Q(1) = 1, Q(2) = 1
step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5
step A4: action 2. Q(1) = 1, Q(2) = 1.6
step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0
For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value.
A3 and A4 can be both greedy and epsilon case.
#### Incremental formula to estimate action-value
- idea: compute incrementally the action values, to avoid doing it every time
- to simplify notation we concentrate on a single action on the next examples
- $R_{i}$ denotes the reward received after the i(th) selection of this action.
- $Q_{n}$ denotes the estimate of its action value after it has been selected n-1 times $$Q_{n}=\frac{R_{1}+R_{2}+\dots+R_{n-1}}{n-1}$$
- given $Q_{n}$ and the reward Rn, the new average of rewards can be computed by $$Q_{n+1}=\frac{1}{n}\sum_{i=1}^nR_{i}$$
General formula: NewEstimate <- OldEstimate + StepSize (Target - OldEstimate) $$Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]$$Target - OldEstimate is the error
Pseudocode for bandit algorithm:
```
Initialize for a = 1 to k:
Q(a) = 0
N(a) = 0
Loop forever:
with probability 1-eps:
A = argmax_a(Q(a))
else:
A = random action
R = bandit(A) # returns the reward of the action A
N(A) = N(A) + 1
Q(A) = Q(A) + 1\N(A) * (R - Q(A))
```
#### Nonstationary problem:
Rewards probabilities change over time.
- in the doctor example, a treatment may not be good in all conditions
- the agent (doctor) is unaware of the changes, he would like to adapt to it, maybe a treatment is good only on a specific season.
An option is to use a fixed step size. We remove the 1/n factor and add an $\alpha$ constant factor between 0 and 1.
And we get $$Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}$$
#### Optimistic initial values
Initial action values can be used as a simple way to encourage exploration!
This way we can make the agent explore more at the beginning, and explore less after a while, this is cool!

View file

@ -15,7 +15,7 @@
![[Pasted image 20241025084755.png]]
.. add siled ...
![[Pasted image 20241025084830.png]]
#### Experiments

View file

@ -0,0 +1,132 @@
MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards
MDPs involve delayed rewards and the need to trade off immediate and delayed rewards
Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection
- MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
- The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
- At each timestep, the agent receives some representation of the environment state and on that basis selects an action
- One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state
- Markov decision processes formally describe an environment for reinforcement learning
- Where the environment is fully observable
- i.e. The current state completely characterises the process
- Almost all RL problems can be formalised as MDPs
- e.g. Bandits are MDPs with one state
Markov property
“The future is independent of the past given the present”
![[Pasted image 20241030102226.png]]![[Pasted image 20241030102243.png]]
- A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
- S finite set of states
- P is a state transition probability matrix
- then $Pss = [St+1=s | St=s]$
Example
![[Pasted image 20241030102420.png]]
![[Pasted image 20241030102722.png]]
#### Markov Reward Process
This is a Markov Process but we also have a reward function! We also have a discount factor.
Markov Reward Process is a tuple ⟨S, P, R, γ⟩
- S is a (finite) set of states
- P is a state transition probability matrix, $Pss = [ St+1=s | St=s ]$
- R is a reward function, $Rs = 𝔼 [ R_{t+1} | St = s ]$
- γ is a discount factor, $γ ∈ [0, 1]$
![[Pasted image 20241030103041.png]]
![[Pasted image 20241030103114.png]]
- The discount $γ ∈ [0, 1]$ is the present value of future rewards
- The value of receiving reward R after k + 1 time-steps is $γ^kR$
- This values immediate reward above delayed reward
- γ close to 0 leads to ”short-sighted” evaluation
- γ close to 1 leads to ”far-sighted” evaluation
Most Markov reward and decision processes are discounted. Why?
- mathematical convenience
- avoids infinite returns in cyclic Markov processes
- uncertainity about the future may not be fully represented
- if the reward is financial, immediate rewards may earn more interest than delayed rewards
- Animal/human behaviour shows preference for immediate rewards
- It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate
Value function
- The value function v(s) gives the long-term value of (being in) state s
- The state value function v(s) of an MRP is the expected return starting from state s $𝑉) = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]$
![[Pasted image 20241030103519.png]]
![[Pasted image 20241030103706.png]]
is a prediction of the reward in next states
![[Pasted image 20241030103739.png]]
![[Pasted image 20241030103753.png]]
- The value function can be decomposed into two parts:
- immediate reward $R_{t+1}$
- discounted value of successor state $γv(St+1)$
![[Pasted image 20241030103902.png]]
#### Bellman Equation for MRPs
$v (s) = E [Rt+1 + v (St+1) | St = s]$
![[Pasted image 20241030104056.png]]
- Bellman equation averages over all the possibilities, weighting each by its probability of occurring
- The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way
![[Pasted image 20241030104229.png]]
4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.
![[Pasted image 20241030104451.png]]
#### Solving the Bellman Equation
- The Bellman equation is a linear equation
- can be solved directly![[Pasted image 20241030104537.png]]
- complexity O(n^3)
- many iterative methods
- dynamic programming
- monte-carlo evaluation
- temporal-difference learning
#### MDP
![[Pasted image 20241030104632.png]]
![[Pasted image 20241030104722.png]]
Before we had random probabilities, now we have actions to chose from. But how do we chose?
We have policies: a distribution over actions given the states: $$𝜋(a|s)= [ At=a | St=s ]$$
- policy fully defines the behavior of the agent
- MDP policies depend on the current state (not the history)
- policies are stationary (time-independent, depend only on the state but not on the time)
##### Value function
The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 $$v𝜋(s) = 𝔼𝜋 [ Gt | St=s ]$$
The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 $$q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]$$
![[Pasted image 20241030105022.png]]
- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v\pi(s) = E\pi[Rt+1 + v⇡(St+1) | St = s]$$
- The action-value function can similarly be decomposed $$q\pi(s, a) = E\pi [Rt+1 + q⇡(St+1, At+1) | St = s, At = a]$$
![[Pasted image 20241030105148.png]]![[Pasted image 20241030105207.png]]
![[Pasted image 20241030105216.png]]
putting all together
(very important, remeber it)
![[Pasted image 20241030105234.png]]
![[Pasted image 20241030105248.png]]
as we can see, an action does not necessarily bring to a specific state.
Example: Gridworld
2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso
![[Pasted image 20241030112555.png]]
![[Pasted image 20241030113041.png]]
![[Pasted image 20241030113135.png]]
C is low because from C I get reward of 0 everywhere I go.