master-degree-notes/Autonomous Networking/notes/6.1 RL.md at fb93080cc535e806230e51592f4987d0897ac678

Marco Realacci ccf1df617d vault backup: 2024-10-21 00:46:21

2024-10-21 00:46:21 +02:00

Case study: battery-free smart home

each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...)
a device should only transmit when it has new data
- but in backscattering-based networks they need to be queried by the receiver

In which order should the reader query tags?

assume prefixed timeslots
TDMA with random access performs poorly
TDMA with fixed assignment also does (wasted queries)
we want to query devices that have new data samples and avoid
- data loss
- redundant queries

Goal: design a mac protocol that adapts to all of this. One possibility is to use Reinforcement Learning

How can an intelligent agent learns to make a good sequence of decisions

an agent can figure out how the world works by trying things and see what happens
is what people and animals do
we explore a computational approach to learning from interaction
- goal-directed learning from interaction

RL is learning what to do, it presents two main characteristics:

trial and error search
delayed reward
sensation, action and goal are the 3 main aspects of a reinforcement learning method
a learning agents must be able to
- sense the state of the environment
- take actions that affects the state

Difference from other ML

Learning online

RL applications:

Rewards

RL based on the reward hypotesis all goals can be described by the maximization of expected cumulative rewards

communication in battery free environments

Challenge:

exploration vs exploitation dilemma:

comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
exploitation: we take advanced of the best option we know
exploration: test new decisions

at each timestamp the agent

the environment ...

agent state: the view of the agent on the environment state, is a function of history

one or more of these components

Policy: agent's behavior function
- defines what to do (behavior at a given time)
- maps state to action
- core of the RL agent
- the policy is altered based on the reward
- may be
  - deterministic: single function of the state
  - stochastic: specifying probabilities for each actions
    - reward changes probabilities
Value function:
- specifies what's good in the long run
- is a prediction of future reward
- used to evaluate the goodness/badness of states
- values are prediction of rewards
- Vp(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]
Model:
- predicts what the environment will do next
- many problems are model free

back to the original problem: