4.3 KiB
Case study: battery-free smart home
- each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...)
- a device should only transmit when it has new data
- but in backscattering-based networks they need to be queried by the receiver
In which order should the reader query tags?
- assume prefixed time slots
- TDMA with random access performs poorly
- TDMA with fixed assignment also does (wasted queries)
- we want to query devices that have new data samples and avoid
- data loss
- redundant queries
Goal: design a mac protocol that adapts to all of this. One possibility is to use Reinforcement Learning
Reinforcement learning
How can an intelligent agent learns to make a good sequence of decisions
- an agent can figure out how the world works by trying things and see what happens
- is what people and animals do
- we explore a computational approach to learning from interaction
- goal-directed learning from interaction
RL is learning what to do, it presents two main characteristics:
-
trial and error search
-
delayed reward
-
sensation, action and goal are the 3 main aspects of a reinforcement learning method
-
a learning agents must be able to
- sense the state of the environment
- take actions that affects the state
Difference from other ML
- no supervisor
- feedback may be delayed
- time matters
- agent action affects future decisions
- a sequence of successful decisions will result in the process being reinforced
- RL learns online
Learning online
- learning while interacting with an ever changing world
- we expect agents to get things wrong, to refine their understanding as they go
- the world is not static, agents continuously encounter new situations
Rewards
- a reward is a scalar feedback signal (a number)
- reward Rt indicates how well the agent is doing at step t
- the agent should maximize cumulative reward
RL based on the reward hypotesis all goals can be described by the maximization of expected cumulative rewards
communication in battery free environments
- positive rewards if the queried device has new data
- else negative
Challenges:
- tradeoff between exploration and exploitation
- to obtain a lot of reward a RL agent must prefer action that it tried in the past
- but better actions may exist... So the agent has to exploit!
exploration vs exploitation dilemma:
- comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
- exploitation: we take advanced of the best option we know
- exploration: test new decisions
A general RL framework
at each timestamp the agent:
- executes action At
- receives observation Ot
- receives scalar reward Rt
the environment:
- receives action At
- emits observation Ot
- emits scalar reward Rt
agent state: the view of the agent on the environment state, is a function of history
- the history is involved in taking the next decision:
- agent selects actions
- environment selects observations/rewards
- the state information is used to determine what happens next
- state is a function of history:
S_t = f(H_t)
- state is a function of history:
Inside the agent
one or more of these components
- Policy: agent's behavior function
- defines what to do (behavior at a given time)
- maps state to action
- core of the RL agent
- the policy is altered based on the reward
- may be
- deterministic: single function of the state
- stochastic: specifying probabilities for each actions
- reward changes probabilities
- Value function:
- specifies what's good in the long run
- is a prediction of future reward
- used to evaluate the goodness/badness of states
- values are prediction of rewards
V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]
- better explained later
- Model:
- predicts what the environment will do next
- may predict the resultant next state and/or the next reward
- many problems are model free
back to the original problem:
- n devices
- each devices produces new data with rate_i
- in which order should the reader query tags?
- formulate as an RL problem
- agent is the reder
- one action per device (query)
- rewards:
- positive when querying a device with new data
- negative if it has no data
- what to do if the device has lost data?
- state?