vault backup: 2024-10-31 13:26:43

2024-10-31 13:26:43 +01:00 · 2024-10-31 13:26:43 +01:00 · 382617bf06
commit 382617bf06
parent 4aaef01b22
59 changed files with 490 additions and 148 deletions
--- a/Networking/notes/7
+++ b/Networking/notes/7
@ -32,7 +32,7 @@ RL is learning what to do, it presents two main characteristics:
 	- take actions that affects the state

 Difference from other ML
- no supervisor
+- **no supervisor**
 - feedback may be delayed
 - time matters
 - agent action affects future decisions
@ -44,13 +44,6 @@ Learning online
 - we expect agents to get things wrong, to refine their understanding as they go
 - the world is not static, agents continuously encounter new situations

-RL applications:
- self driving cars
- engineering
- healthcare
- news recommendation
- ...
-
 Rewards
 - a reward is a scalar feedback signal (a number)
 - reward Rt indicates how well the agent is doing at step t
@ -63,12 +56,12 @@ communication in battery free environments
 - positive rewards if the queried device has new data
 - else negative

-Challenge:
+#### Challenges:
 - tradeoff between exploration and exploitation
 - to obtain a lot of reward a RL agent must prefer action that it tried in the past
 - but better actions may exist... So the agent has to exploit!

-exploration vs exploitation dilemma:
+##### exploration vs exploitation dilemma:
 - comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
 - exploitation: we take advanced of the best option we know
 - exploration: test new decisions
@ -108,6 +101,7 @@ one or more of these components
 	- used to evaluate the goodness/badness of states
 	- values are prediction of rewards
 	- $V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]$
+		- better explained later
 - **Model:**
 	- predicts what the environment will do next
 	- may predict the resultant next state and/or the next reward
@ -124,103 +118,4 @@ back to the original problem:
 		- positive when querying a device with new data
 		- negative if it has no data
 		- what to do if the device has lost data?
-	- state?
-
-
-### Exploration vs exploitation trade-off
- Rewards evaluate actions taken
- evaluative feedback depends on the action taken
- no active exploration
-
-Let's consider a simplified version of an RL problem: K-armed bandit problem.
- K different options
- every time need to chose one
- maximize expected total reward over some time period
- analogy with slot machines
-	- the levers are the actions
-	- which level gives the highest reward?
- Formalization
-	- set of actions A (or "arms")
-	- reward function R that follows an unknown probability distributions
-	- only one state
-	- ...
-
-Example: doctor treatment
- doctor has 3 treatments (actions), each of them has a reward.
- for the doctor to decide which action to take is best, we must define the value of taking each action
- we call these values the action values (or action value function)
- action value: ... 
-
-Each action has a reward defined by a probability distribution.
- the red treatment has a bernoulli probability
- the yellow treatment binomial
- the blue uniform
- the agent does not know the distributions!
- the estimated action for action a is the sum of rewards observed divided by the total time the action has been taken (add formula ...)
-	- 1predicate denotes the random variable (1 if true else 0)
-
- greedy action:
-	- doctors assign the treatment they currently think is the best
-	- ...
-	- the greedy action is computed as the argmax of Q values
-	- greedy always exploits current knowledge
- epsilon-greedy:
-	- with a probability epsilon sometimes we explore
-		- 1-eps probability: we chose best greedy action
-		- eps probability: we chose random action
-
-exercises ...
-
-exercise 2: k-armed bandit problem.
-K = 4 actions, denoted 1,2,3 and 4
-eps-greedy selection
-initial Q estimantes = 0 for all a.
-
-Initial sequenze of actions and rewards is:
-A1 = 1   R1 = 1
-A2 = 2  R2 = 2
-A3 = 2  R3 = 2
-A4 = 2  R4 = 2
-A5 = 3  R5 = 0
-
---
-step A1: action 1 selected. Q of action 1 is 1
-step A2: action 2 selected. Q(1) = 1, Q(2) = 1
-step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5
-step A4: action 2. Q(1) = 1, Q(2) = 1.6
-step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0
-
-For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value.
-A3 and A4 can be both greedy and epsilon case.
-
-#### Incremental formula to estimate action-value
- to simplify notation we concentrate on a single action
- Ri denotes the reward received after the i(th) selection of this action. Qn denotes the estimate of its action value after it has been selected n-1 times (add Qn formula ...)
- given Qn and the reward Rn, the new average of rewards can be computed by (add formula with simplifications...) $Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]$
-	- NewEstimate <- OldEstimate + StepSize (Target - OldEstimate)
-	- Target - OldEstimate is the error
-
-Pseudocode for bandit algorithm:
-```
-Initialize for a = 1 to k:
-	Q(a) = 0
-	N(a) = 0
-Loop forever:
-	with probability 1-eps:
-		A = argmax_a(Q(a))
-	else:
-		A = random action
-	R = bandit(A) # returns the reward of the action A
-	N(A) = N(A) + 1
-	Q(A) = Q(A) + 1\N(A) * (R - Q(A))
-```
-
-Nonstationary problem: rewards probabilities change over time.
- in the doctor example, a treatment may not be good in all conditions
- the agent (doctor) is unaware of the changes, he would like to adapt to it
-
-An option is to use a fixed step size. We remove the 1/n factor and add an $\alpha$ constant factor between 0 and 1.
-And we get $Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}$
-
-
-... ADD MISSING PART ...
+	- state?
--- a/Networking/notes/7.1
+++ b/Networking/notes/7.1
@ -0,0 +1,122 @@
+### K-Armed bandit problem
+- Rewards evaluate actions taken
+- evaluative feedback depends on the action taken
+- no active exploration
+
+Let's consider a simplified version of an RL problem: K-armed bandit problem.
+- K different options
+- every time need to chose one
+- maximize expected total reward over some time period
+- analogy with slot machines
+	- the levers are the actions
+	- which lever gives the highest reward?
+- **Formalization**
+	- set of actions A (or "arms")
+	- reward function R that follows an unknown probability distributions
+	- only one state
+	- at each step t, agent selects an action A
+	- environment generates reward
+	- goal to maximize cumulative reward
+
+Example: doctor treatment
+- doctor has 3 treatments (actions), each of them has a reward.
+- for the doctor to decide which action to take is best, we must define the value of taking each action
+- we call these values the action values (or action value function)
+- **action value:** $$q_{*}=E[R_{t} \mid A_{t}=a]$$
+Each action has a reward defined by a probability distribution.
+- the red treatment has a Bernoulli probability
+- the yellow treatment binomial
+- the blue uniform
+- the agent does not know the distributions!
+![[Pasted image 20241030165705.png]]
+
+- **the estimated action value Q** for action a is the sum of rewards observed divided by the total time the action has been taken
+
+- **greedy action:**
+	- doctors assign the treatment they currently think is the best
+	- greedy action is the action that currently has the largest estimated action value $$A_{t}=argmax(Q_{t}(a))$$
+	- greedy always exploits current knowledge
+- **epsilon-greedy:**
+	- with a probability epsilon sometimes we explore
+		- 1-eps probability: we chose best greedy action
+		- eps probability: we chose random action
+
+#### Exercise 1
+In ε-greedy action selection, for the case of two actions and ε=0.5, what is the probability that the greedy action is selected?
+*We have two actions. The probability of selecting the greedy action is 50%.
+But when the exploration happens, the greedy actions may be selected!
+So, with 0.5 prob. we select greedy action. With 0.5 prob. we select random action, which can be both. So in the random case, we select the greedy action with 0.5 * 0.5 probability = 0.25.
+Finally, we select the greedy action with 0.5 + 0.25 = 0.75 probability.
+
+#### Exercise 2
+Consider K-armed bandit problem.
+K = 4 actions, denoted 1,2,3 and 4
+Agent uses eps-greedy action selection
+initial Q estimantes is 0 for all actions: $$Q_{1}(a)=0$$
+Initial sequenze of actions and rewards is:
+A1 = 1   R1 = 1
+A2 = 2  R2 = 2
+A3 = 2  R3 = 2
+A4 = 2  R4 = 2
+A5 = 3  R5 = 0
+
+On some of those time steps, the epsilon case may have occurred, causing an action to be selected at random. On which time steps did this definitely occur?
+On which time steps could this possibly have occurred?
+
+***Answer***
+to answer, we need to compute the action sequence
+
+in the table, Qa means Q value of the action a at current state!
+
+| steps          | Q1  | Q2   | Q3  | Q4  |
+| -------------- | --- | ---- | --- | --- |
+| A1 \| action 1 | 1   | 0    | 0   | 0   |
+| A2 \| action 2 | 1   | 1    | 0   | 0   |
+| A3 \| action 2 | 1   | 1.5  | 0   | 0   |
+| A4 \| action 2 | 1   | 1.66 | 0   | 0   |
+| A5 \| action 3 | 1   | 1.66 | 0   | 0   |
+
+
+
+step A1: action 1 selected. Q of action 1 is 1
+step A2: action 2 selected. Q(1) = 1, Q(2) = 1
+step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5
+step A4: action 2. Q(1) = 1, Q(2) = 1.6
+step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0
+
+For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value.
+A3 and A4 can be both greedy and epsilon case.
+
+#### Incremental formula to estimate action-value
+- idea: compute incrementally the action values, to avoid doing it every time
+- to simplify notation we concentrate on a single action on the next examples
+	- $R_{i}$ denotes the reward received after the i(th) selection of this action.
+	- $Q_{n}$ denotes the estimate of its action value after it has been selected n-1 times $$Q_{n}=\frac{R_{1}+R_{2}+\dots+R_{n-1}}{n-1}$$
+- given $Q_{n}$ and the reward Rn, the new average of rewards can be computed by $$Q_{n+1}=\frac{1}{n}\sum_{i=1}^nR_{i}$$
+General formula: NewEstimate <- OldEstimate + StepSize (Target - OldEstimate) $$Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]$$Target - OldEstimate is the error
+
+Pseudocode for bandit algorithm:
+```
+Initialize for a = 1 to k:
+	Q(a) = 0
+	N(a) = 0
+Loop forever:
+	with probability 1-eps:
+		A = argmax_a(Q(a))
+	else:
+		A = random action
+	R = bandit(A) # returns the reward of the action A
+	N(A) = N(A) + 1
+	Q(A) = Q(A) + 1\N(A) * (R - Q(A))
+```
+
+#### Nonstationary problem:
+Rewards probabilities change over time.
+- in the doctor example, a treatment may not be good in all conditions
+- the agent (doctor) is unaware of the changes, he would like to adapt to it, maybe a treatment is good only on a specific season.
+
+An option is to use a fixed step size. We remove the 1/n factor and add an $\alpha$ constant factor between 0 and 1.
+And we get $$Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}$$
+#### Optimistic initial values
+Initial action values can be used as a simple way to encourage exploration!
+This way we can make the agent explore more at the beginning, and explore less after a while, this is cool!
--- a/Networking/notes/8.md
+++ b/Networking/notes/8.md
@ -15,7 +15,7 @@

 ![[Pasted image 20241025084755.png]]

-.. add siled ...
+
 ![[Pasted image 20241025084830.png]]

 #### Experiments
--- a/Networking/notes/9
+++ b/Networking/notes/9
@ -0,0 +1,132 @@
+MDPs are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations (states) and through those future rewards
+
+MDPs involve delayed rewards and the need to trade off immediate and delayed rewards
+
+Whereas in bandit we estimated the q*(a) of each action a, in MDPs we estimate the value q*(a,s) of each action a in each state s, or we estimate the value v*(s) of each state s given optimal action selection
+
+
+- MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a goal.
+- The agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2, 3, . . .
+- At each timestep, the agent receives some representation of the environment state and on that basis selects an action
+- One time step later, in part as a consequence of its action, the agent receives a numerical reward and finds itself in a new state
+
+- Markov decision processes formally describe an environment for reinforcement learning
+- Where the environment is fully observable
+-  i.e. The current state completely characterises the process
+- Almost all RL problems can be formalised as MDPs
+	- e.g. Bandits are MDPs with one state
+
+Markov property
+“The future is independent of the past given the present”
+![[Pasted image 20241030102226.png]]![[Pasted image 20241030102243.png]]
+
+- A Markov process (or markov chain) is a memoryless random process, i.e. a sequence of random states S1, S2, ... with the Markov property.
+	- S finite set of states
+	- P is a state transition probability matrix
+	- then $Pss′ = ℙ [St+1=s′ | St=s]$
+
+Example
+![[Pasted image 20241030102420.png]]
+![[Pasted image 20241030102722.png]]
+
+#### Markov Reward Process
+This is a Markov Process but we also have a reward function! We also have a discount factor.
+
+ Markov Reward Process is a tuple ⟨S, P, R, γ⟩
+ - S is a (finite) set of states
+ - P is a state transition probability matrix, $Pss′ = ℙ [ St+1=s′ | St=s ]$
+ - R is a reward function, $Rs = 𝔼 [ R_{t+1} | St = s ]$
+ - γ is a discount factor, $γ ∈ [0, 1]$
+
+![[Pasted image 20241030103041.png]]
+
+ ![[Pasted image 20241030103114.png]]
+-  The discount $γ ∈ [0, 1]$ is the present value of future rewards
+-  The value of receiving reward R after k + 1 time-steps is $γ^kR$
+- This values immediate reward above delayed reward
+	- γ close to 0 leads to ”short-sighted” evaluation
+	- γ close to 1 leads to ”far-sighted” evaluation
+
+ Most Markov reward and decision processes are discounted. Why?
+ - mathematical convenience
+ - avoids infinite returns in cyclic Markov processes
+ - uncertainity about the future may not be fully represented
+ - if the reward is financial, immediate rewards may earn more interest than delayed rewards
+ - Animal/human behaviour shows preference for immediate rewards
+ - It is sometimes possible to use undiscounted Markov reward processess (gamma = 1) e.g. if all sequences terminate
+
+Value function
+- The value function v(s) gives the long-term value of (being in) state s
+- The state value function v(s) of an MRP is the expected return starting from state s $𝑉) = 𝔼 [𝐺𝑡 |𝑆𝑡 = 𝑠]$
+
+![[Pasted image 20241030103519.png]]
+![[Pasted image 20241030103706.png]]
+is a prediction of the reward in next states
+
+![[Pasted image 20241030103739.png]]
+![[Pasted image 20241030103753.png]]
+
+- The value function can be decomposed into two parts:
+	- immediate reward $R_{t+1}$
+	- discounted value of successor state $γv(St+1)$
+
+![[Pasted image 20241030103902.png]]
+
+#### Bellman Equation for MRPs
+$v (s) = E [Rt+1 + v (St+1) | St = s]$
+![[Pasted image 20241030104056.png]]
+
+- Bellman equation averages over all the possibilities, weighting each by its probability of occurring
+- The value of the start state must be equal the (discounted) value of the expected next state, plus the reward expected along the way
+
+![[Pasted image 20241030104229.png]]
+4.3 is the reward I get exiting from the state (-2) plus the discount times the value of the next state + ecc.
+
+
+![[Pasted image 20241030104451.png]]
+
+#### Solving the Bellman Equation
+- The Bellman equation is a linear equation
+- can be solved directly![[Pasted image 20241030104537.png]]
+- complexity O(n^3)
+- many iterative methods
+	- dynamic programming
+	- monte-carlo evaluation
+	- temporal-difference learning
+
+#### MDP
+![[Pasted image 20241030104632.png]]
+
+![[Pasted image 20241030104722.png]]
+
+Before we had random probabilities, now we have actions to chose from. But how do we chose?
+We have policies: a distribution over actions given the states: $$𝜋(a|s)= ℙ [ At=a | St=s ]$$
+- policy fully defines the behavior of the agent
+- MDP policies depend on the current state (not the history)
+- policies are stationary (time-independent, depend only on the state but not on the time)
+
+
+##### Value function
+The state-value function v𝜋(s) of an MDP is the expected return starting from state s, and then following policy 𝜋 $$v𝜋(s) = 𝔼𝜋 [ Gt | St=s ]$$
+The action-value function q 𝜋 (s,a) is the expected return starting from state s, taking action a, and then following policy 𝜋 $$q 𝜋(a|s)= 𝔼𝜋 [ Gt | St=s, At=a ]$$
+![[Pasted image 20241030105022.png]]
+
+- The state-value function can again be decomposed into immediate reward plus discounted value of successor state $$v\pi(s) = E\pi[Rt+1 + v⇡(St+1) | St = s]$$
+- The action-value function can similarly be decomposed $$q\pi(s, a) = E\pi [Rt+1 + q⇡(St+1, At+1) | St = s, At = a]$$
+![[Pasted image 20241030105148.png]]![[Pasted image 20241030105207.png]]
+![[Pasted image 20241030105216.png]]
+putting all together
+(very important, remeber it)
+
+![[Pasted image 20241030105234.png]]
+
+![[Pasted image 20241030105248.png]]
+as we can see, an action does not necessarily bring to a specific state.
+
+
+Example: Gridworld
+2 azioni che rimandano allo stesso stato, due azioni che vanno in uno stato diverso
+![[Pasted image 20241030112555.png]]
+![[Pasted image 20241030113041.png]]
+![[Pasted image 20241030113135.png]]
+C is low because from C I get reward of 0 everywhere I go.