master-degree-notes/Autonomous Networking/notes/7 RL.md at 97f314dbb48ab52af9993a824c62124e32efd355

Marco Realacci 382617bf06 vault backup: 2024-10-31 13:26:43

2024-10-31 13:26:43 +01:00

Case study: battery-free smart home

each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...)
a device should only transmit when it has new data
- but in backscattering-based networks they need to be queried by the receiver

In which order should the reader query tags?

assume prefixed time slots
TDMA with random access performs poorly
TDMA with fixed assignment also does (wasted queries)
we want to query devices that have new data samples and avoid
- data loss
- redundant queries

Goal: design a mac protocol that adapts to all of this. One possibility is to use Reinforcement Learning

How can an intelligent agent learns to make a good sequence of decisions

an agent can figure out how the world works by trying things and see what happens
is what people and animals do
we explore a computational approach to learning from interaction
- goal-directed learning from interaction

RL is learning what to do, it presents two main characteristics:

trial and error search
delayed reward
sensation, action and goal are the 3 main aspects of a reinforcement learning method
a learning agents must be able to
- sense the state of the environment
- take actions that affects the state

Difference from other ML

Learning online

Rewards

RL based on the reward hypotesis all goals can be described by the maximization of expected cumulative rewards

communication in battery free environments

tradeoff between exploration and exploitation
to obtain a lot of reward a RL agent must prefer action that it tried in the past
but better actions may exist... So the agent has to exploit!

comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
exploitation: we take advanced of the best option we know
exploration: test new decisions

at each timestamp the agent:

the environment:

agent state: the view of the agent on the environment state, is a function of history

the history is involved in taking the next decision:
- agent selects actions
- environment selects observations/rewards
the state information is used to determine what happens next
- state is a function of history: S_t = f(H_t)

one or more of these components

Policy: agent's behavior function
- defines what to do (behavior at a given time)
- maps state to action
- core of the RL agent
- the policy is altered based on the reward
- may be
  - deterministic: single function of the state
  - stochastic: specifying probabilities for each actions
    - reward changes probabilities
Value function:
- specifies what's good in the long run
- is a prediction of future reward
- used to evaluate the goodness/badness of states
- values are prediction of rewards
- V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]
  - better explained later
Model:
- predicts what the environment will do next
- may predict the resultant next state and/or the next reward
- many problems are model free

back to the original problem: