vault backup: 2024-10-31 13:26:43

2024-10-31 13:26:43 +01:00 · 2024-10-31 13:26:43 +01:00 · 382617bf06
commit 382617bf06
parent 4aaef01b22
59 changed files with 490 additions and 148 deletions
--- a/Networking/notes/7
+++ b/Networking/notes/7
@ -32,7 +32,7 @@ RL is learning what to do, it presents two main characteristics:
 	- take actions that affects the state

 Difference from other ML
- no supervisor
+- **no supervisor**
 - feedback may be delayed
 - time matters
 - agent action affects future decisions
@ -44,13 +44,6 @@ Learning online
 - we expect agents to get things wrong, to refine their understanding as they go
 - the world is not static, agents continuously encounter new situations

-RL applications:
- self driving cars
- engineering
- healthcare
- news recommendation
- ...
-
 Rewards
 - a reward is a scalar feedback signal (a number)
 - reward Rt indicates how well the agent is doing at step t
@ -63,12 +56,12 @@ communication in battery free environments
 - positive rewards if the queried device has new data
 - else negative

-Challenge:
+#### Challenges:
 - tradeoff between exploration and exploitation
 - to obtain a lot of reward a RL agent must prefer action that it tried in the past
 - but better actions may exist... So the agent has to exploit!

-exploration vs exploitation dilemma:
+##### exploration vs exploitation dilemma:
 - comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
 - exploitation: we take advanced of the best option we know
 - exploration: test new decisions
@ -108,6 +101,7 @@ one or more of these components
 	- used to evaluate the goodness/badness of states
 	- values are prediction of rewards
 	- $V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]$
+		- better explained later
 - **Model:**
 	- predicts what the environment will do next
 	- may predict the resultant next state and/or the next reward
@ -124,103 +118,4 @@ back to the original problem:
 		- positive when querying a device with new data
 		- negative if it has no data
 		- what to do if the device has lost data?
-	- state?
-
-
-### Exploration vs exploitation trade-off
- Rewards evaluate actions taken
- evaluative feedback depends on the action taken
- no active exploration
-
-Let's consider a simplified version of an RL problem: K-armed bandit problem.
- K different options
- every time need to chose one
- maximize expected total reward over some time period
- analogy with slot machines
-	- the levers are the actions
-	- which level gives the highest reward?
- Formalization
-	- set of actions A (or "arms")
-	- reward function R that follows an unknown probability distributions
-	- only one state
-	- ...
-
-Example: doctor treatment
- doctor has 3 treatments (actions), each of them has a reward.
- for the doctor to decide which action to take is best, we must define the value of taking each action
- we call these values the action values (or action value function)
- action value: ... 
-
-Each action has a reward defined by a probability distribution.
- the red treatment has a bernoulli probability
- the yellow treatment binomial
- the blue uniform
- the agent does not know the distributions!
- the estimated action for action a is the sum of rewards observed divided by the total time the action has been taken (add formula ...)
-	- 1predicate denotes the random variable (1 if true else 0)
-
- greedy action:
-	- doctors assign the treatment they currently think is the best
-	- ...
-	- the greedy action is computed as the argmax of Q values
-	- greedy always exploits current knowledge
- epsilon-greedy:
-	- with a probability epsilon sometimes we explore
-		- 1-eps probability: we chose best greedy action
-		- eps probability: we chose random action
-
-exercises ...
-
-exercise 2: k-armed bandit problem.
-K = 4 actions, denoted 1,2,3 and 4
-eps-greedy selection
-initial Q estimantes = 0 for all a.
-
-Initial sequenze of actions and rewards is:
-A1 = 1   R1 = 1
-A2 = 2  R2 = 2
-A3 = 2  R3 = 2
-A4 = 2  R4 = 2
-A5 = 3  R5 = 0
-
---
-step A1: action 1 selected. Q of action 1 is 1
-step A2: action 2 selected. Q(1) = 1, Q(2) = 1
-step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5
-step A4: action 2. Q(1) = 1, Q(2) = 1.6
-step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0
-
-For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value.
-A3 and A4 can be both greedy and epsilon case.
-
-#### Incremental formula to estimate action-value
- to simplify notation we concentrate on a single action
- Ri denotes the reward received after the i(th) selection of this action. Qn denotes the estimate of its action value after it has been selected n-1 times (add Qn formula ...)
- given Qn and the reward Rn, the new average of rewards can be computed by (add formula with simplifications...) $Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]$
-	- NewEstimate <- OldEstimate + StepSize (Target - OldEstimate)
-	- Target - OldEstimate is the error
-
-Pseudocode for bandit algorithm:
-```
-Initialize for a = 1 to k:
-	Q(a) = 0
-	N(a) = 0
-Loop forever:
-	with probability 1-eps:
-		A = argmax_a(Q(a))
-	else:
-		A = random action
-	R = bandit(A) # returns the reward of the action A
-	N(A) = N(A) + 1
-	Q(A) = Q(A) + 1\N(A) * (R - Q(A))
-```
-
-Nonstationary problem: rewards probabilities change over time.
- in the doctor example, a treatment may not be good in all conditions
- the agent (doctor) is unaware of the changes, he would like to adapt to it
-
-An option is to use a fixed step size. We remove the 1/n factor and add an $\alpha$ constant factor between 0 and 1.
-And we get $Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}$
-
-
-... ADD MISSING PART ...
+	- state?