master-degree-notes/Autonomous Networking/notes/7 RL.md at a39eb9c1929776fd3a1114b75bd1e098ea1aadd1

marcorealacci/master-degree-notes

Fork 0

Marco Realacci b1b496972b vault backup: 2024-10-24 14:50:54

2024-10-24 14:50:54 +02:00

7.6 KiB

Raw Blame History

Case study: battery-free smart home

each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...)
a device should only transmit when it has new data
- but in backscattering-based networks they need to be queried by the receiver

In which order should the reader query tags?

assume prefixed time slots
TDMA with random access performs poorly
TDMA with fixed assignment also does (wasted queries)
we want to query devices that have new data samples and avoid
- data loss
- redundant queries

Goal: design a mac protocol that adapts to all of this. One possibility is to use Reinforcement Learning

Reinforcement learning

How can an intelligent agent learns to make a good sequence of decisions

an agent can figure out how the world works by trying things and see what happens
is what people and animals do
we explore a computational approach to learning from interaction
- goal-directed learning from interaction

RL is learning what to do, it presents two main characteristics:

trial and error search
delayed reward
sensation, action and goal are the 3 main aspects of a reinforcement learning method
a learning agents must be able to
- sense the state of the environment
- take actions that affects the state

Difference from other ML

no supervisor
feedback may be delayed
time matters
agent action affects future decisions
a sequence of successful decisions will result in the process being reinforced
RL learns online

Learning online

learning while interacting with an ever changing world
we expect agents to get things wrong, to refine their understanding as they go
the world is not static, agents continuously encounter new situations

RL applications:

self driving cars
engineering
healthcare
news recommendation
...

Rewards

a reward is a scalar feedback signal (a number)
reward Rt indicates how well the agent is doing at step t
the agent should maximize cumulative reward

RL based on the reward hypotesis all goals can be described by the maximization of expected cumulative rewards

communication in battery free environments

positive rewards if the queried device has new data
else negative

Challenge:

tradeoff between exploration and exploitation
to obtain a lot of reward a RL agent must prefer action that it tried in the past
but better actions may exist... So the agent has to exploit!

exploration vs exploitation dilemma:

comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
exploitation: we take advanced of the best option we know
exploration: test new decisions

A general RL framework

at each timestamp the agent:

executes action At
receives observation Ot
receives scalar reward Rt

the environment:

receives action At
emits observation Ot
emits scalar reward Rt

agent state: the view of the agent on the environment state, is a function of history

the function of the history is involved in taking the next decision
the state representation defines what happens next
...

Inside the agent

one or more of these components

Policy: agent's behavior function
- defines what to do (behavior at a given time)
- maps state to action
- core of the RL agent
- the policy is altered based on the reward
- may be
  - deterministic: single function of the state
  - stochastic: specifying probabilities for each actions
    - reward changes probabilities
Value function:
- specifies what's good in the long run
- is a prediction of future reward
- used to evaluate the goodness/badness of states
- values are prediction of rewards
- Vp(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]
Model:
- predicts what the environment will do next
- many problems are model free

back to the original problem:

n devices
each devices produces new data with rate_i
in which order should the reader query tags?
formulate as an RL problem
- agent is the reder
- one action per device (query)
- rewards:
  - positive when querying a device with new data
  - negative if it has no data
  - what to do if the device has lost data?
- state?

Exploration vs exploitation trade-off

Rewards evaluate actions taken
evaluative feedback depends on the action taken
no active exploration

Let's consider a simplified version of an RL problem: K-armed bandit problem.

K different options
every time need to chose one
maximize expected total reward over some time period
analogy with slot machines
- the levers are the actions
- which level gives the highest reward?
Formalization
- set of actions A (or "arms")
- reward function R that follows an unknown probability distributions
- only one state
- ...

Example: doctor treatment

doctor has 3 treatments (actions), each of them has a reward.
for the doctor to decide which action to take is best, we must define the value of taking each action
we call these values the action values (or action value function)
action value: ...

Each action has a reward defined by a probability distribution.

the red treatment has a bernoulli probability
the yellow treatment binomial
the blue uniform
the agent does not know the distributions!
the estimated action for action a is the sum of rewards observed divided by the total time the action has been taken (add formula ...)
- 1predicate denotes the random variable (1 if true else 0)
greedy action:
- doctors assign the treatment they currently think is the best
- ...
- the greedy action is computed as the argmax of Q values
- greedy always exploits current knowledge
epsilon-greedy:
- with a probability epsilon sometimes we explore
  - 1-eps probability: we chose best greedy action
  - eps probability: we chose random action

exercises ...

exercise 2: k-armed bandit problem. K = 4 actions, denoted 1,2,3 and 4 eps-greedy selection initial Q estimantes = 0 for all a.

Initial sequenze of actions and rewards is: A1 = 1 R1 = 1 A2 = 2 R2 = 2 A3 = 2 R3 = 2 A4 = 2 R4 = 2 A5 = 3 R5 = 0

step A1: action 1 selected. Q of action 1 is 1 step A2: action 2 selected. Q(1) = 1, Q(2) = 1 step A3: action 2 selected. Q(1) = 2, Q(2) = 1.5 step A4: action 2. Q(1) = 1, Q(2) = 1.6 step A5: action 3. Q(1) = 1, Q(2) = 1.6, Q(3) = 0

For sure A2 and A5 are epsilon cases, system didn't chose the one with highest Q value. A3 and A4 can be both greedy and epsilon case.

Incremental formula to estimate action-value

to simplify notation we concentrate on a single action
Ri denotes the reward received after the i(th) selection of this action. Qn denotes the estimate of its action value after it has been selected n-1 times (add Qn formula ...)
given Qn and the reward Rn, the new average of rewards can be computed by (add formula with simplifications...) Q_(n+1) = Q_{n} + \frac{1}{n}[Rn - Qn]
- NewEstimate <- OldEstimate + StepSize (Target - OldEstimate)
- Target - OldEstimate is the error

Pseudocode for bandit algorithm:

Initialize for a = 1 to k:
	Q(a) = 0
	N(a) = 0
Loop forever:
	with probability 1-eps:
		A = argmax_a(Q(a))
	else:
		A = random action
	R = bandit(A) # returns the reward of the action A
	N(A) = N(A) + 1
	Q(A) = Q(A) + 1\N(A) * (R - Q(A))

Nonstationary problem: rewards probabilities change over time.

in the doctor example, a treatment may not be good in all conditions
the agent (doctor) is unaware of the changes, he would like to adapt to it

An option is to use a fixed step size. We remove the 1/n factor and add an \alpha constant factor between 0 and 1. And we get Q_{n+1} = (1-\alpha)^{n}Q_1 + \sum_{i=1}^{n}{\alpha(1 - \alpha)^{(n-1)} R_i}

... ADD MISSING PART ...

7.6 KiB Raw Blame History