2024-10-24 14:50:54 +02:00
|
|
|
Case study: battery-free smart home
|
|
|
|
- each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...)
|
|
|
|
- a device should only transmit when it has new data
|
|
|
|
- but in backscattering-based networks they need to be queried by the receiver
|
|
|
|
|
|
|
|
In which order should the reader query tags?
|
|
|
|
- assume prefixed time slots
|
|
|
|
- TDMA with random access performs poorly
|
|
|
|
- TDMA with fixed assignment also does (wasted queries)
|
|
|
|
- we want to query devices that have new data samples and avoid
|
|
|
|
- data loss
|
|
|
|
- redundant queries
|
|
|
|
|
|
|
|
Goal: design a mac protocol that adapts to all of this.
|
|
|
|
One possibility is to use Reinforcement Learning
|
|
|
|
|
|
|
|
#### Reinforcement learning
|
|
|
|
How can an intelligent agent learns to make a good sequence of decisions
|
|
|
|
|
|
|
|
- an agent can figure out how the world works by trying things and see what happens
|
|
|
|
- is what people and animals do
|
|
|
|
- we explore a computational approach to learning from interaction
|
|
|
|
- goal-directed learning from interaction
|
|
|
|
|
|
|
|
RL is learning what to do, it presents two main characteristics:
|
|
|
|
- trial and error search
|
|
|
|
- delayed reward
|
|
|
|
|
|
|
|
- sensation, action and goal are the 3 main aspects of a reinforcement learning method
|
|
|
|
- a learning agents must be able to
|
|
|
|
- sense the state of the environment
|
|
|
|
- take actions that affects the state
|
|
|
|
|
|
|
|
Difference from other ML
|
2024-10-31 13:26:43 +01:00
|
|
|
- **no supervisor**
|
2024-10-24 14:50:54 +02:00
|
|
|
- feedback may be delayed
|
|
|
|
- time matters
|
|
|
|
- agent action affects future decisions
|
|
|
|
- a sequence of successful decisions will result in the process being reinforced
|
|
|
|
- RL learns online
|
|
|
|
|
|
|
|
Learning online
|
|
|
|
- learning while interacting with an ever changing world
|
|
|
|
- we expect agents to get things wrong, to refine their understanding as they go
|
|
|
|
- the world is not static, agents continuously encounter new situations
|
|
|
|
|
|
|
|
Rewards
|
|
|
|
- a reward is a scalar feedback signal (a number)
|
|
|
|
- reward Rt indicates how well the agent is doing at step t
|
|
|
|
- the agent should maximize cumulative reward
|
|
|
|
|
|
|
|
RL based on the reward hypotesis
|
|
|
|
all goals can be described by the maximization of expected cumulative rewards
|
|
|
|
|
|
|
|
communication in battery free environments
|
|
|
|
- positive rewards if the queried device has new data
|
|
|
|
- else negative
|
|
|
|
|
2024-10-31 13:26:43 +01:00
|
|
|
#### Challenges:
|
2024-10-24 14:50:54 +02:00
|
|
|
- tradeoff between exploration and exploitation
|
|
|
|
- to obtain a lot of reward a RL agent must prefer action that it tried in the past
|
|
|
|
- but better actions may exist... So the agent has to exploit!
|
|
|
|
|
2024-10-31 13:26:43 +01:00
|
|
|
##### exploration vs exploitation dilemma:
|
2024-10-24 14:50:54 +02:00
|
|
|
- comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
|
|
|
|
- exploitation: we take advanced of the best option we know
|
|
|
|
- exploration: test new decisions
|
|
|
|
|
|
|
|
### A general RL framework
|
|
|
|
**at each timestamp the agent:**
|
|
|
|
- executes action At
|
|
|
|
- receives observation Ot
|
|
|
|
- receives scalar reward Rt
|
|
|
|
|
|
|
|
**the environment:**
|
|
|
|
- receives action At
|
|
|
|
- emits observation Ot
|
|
|
|
- emits scalar reward Rt
|
|
|
|
|
|
|
|
**agent state:** the view of the agent on the environment state, is a function of history
|
2024-10-24 15:52:06 +02:00
|
|
|
- the history is involved in taking the next decision:
|
|
|
|
- agent selects actions
|
|
|
|
- environment selects observations/rewards
|
|
|
|
- the state information is used to determine what happens next
|
|
|
|
- state is a function of history: $S_t = f(H_t)$
|
|
|
|
|
2024-10-24 14:50:54 +02:00
|
|
|
#### Inside the agent
|
|
|
|
one or more of these components
|
|
|
|
- **Policy:** agent's behavior function
|
|
|
|
- defines what to do (behavior at a given time)
|
|
|
|
- maps state to action
|
|
|
|
- core of the RL agent
|
|
|
|
- the policy is altered based on the reward
|
|
|
|
- may be
|
|
|
|
- deterministic: single function of the state
|
|
|
|
- stochastic: specifying probabilities for each actions
|
|
|
|
- reward changes probabilities
|
|
|
|
- **Value function:**
|
|
|
|
- specifies what's good in the long run
|
|
|
|
- is a prediction of future reward
|
|
|
|
- used to evaluate the goodness/badness of states
|
|
|
|
- values are prediction of rewards
|
2024-10-24 15:52:06 +02:00
|
|
|
- $V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]$
|
2024-10-31 13:26:43 +01:00
|
|
|
- better explained later
|
2024-10-24 14:50:54 +02:00
|
|
|
- **Model:**
|
|
|
|
- predicts what the environment will do next
|
2024-10-24 15:52:06 +02:00
|
|
|
- may predict the resultant next state and/or the next reward
|
2024-10-24 14:50:54 +02:00
|
|
|
- many problems are model free
|
|
|
|
|
|
|
|
back to the original problem:
|
|
|
|
- n devices
|
|
|
|
- each devices produces new data with rate_i
|
|
|
|
- in which order should the reader query tags?
|
|
|
|
- formulate as an RL problem
|
|
|
|
- agent is the reder
|
|
|
|
- one action per device (query)
|
|
|
|
- rewards:
|
|
|
|
- positive when querying a device with new data
|
|
|
|
- negative if it has no data
|
|
|
|
- what to do if the device has lost data?
|
2024-10-31 13:26:43 +01:00
|
|
|
- state?
|