Case study: battery-free smart home - each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...) - a device should only transmit when it has new data - but in backscattering-based networks they need to be queried by the receiver In which order should the reader query tags? - assume prefixed time slots - TDMA with random access performs poorly - TDMA with fixed assignment also does (wasted queries) - we want to query devices that have new data samples and avoid - data loss - redundant queries Goal: design a mac protocol that adapts to all of this. One possibility is to use Reinforcement Learning #### Reinforcement learning How can an intelligent agent learns to make a good sequence of decisions - an agent can figure out how the world works by trying things and see what happens - is what people and animals do - we explore a computational approach to learning from interaction - goal-directed learning from interaction RL is learning what to do, it presents two main characteristics: - trial and error search - delayed reward - sensation, action and goal are the 3 main aspects of a reinforcement learning method - a learning agents must be able to - sense the state of the environment - take actions that affects the state Difference from other ML - **no supervisor** - feedback may be delayed - time matters - agent action affects future decisions - a sequence of successful decisions will result in the process being reinforced - RL learns online Learning online - learning while interacting with an ever changing world - we expect agents to get things wrong, to refine their understanding as they go - the world is not static, agents continuously encounter new situations Rewards - a reward is a scalar feedback signal (a number) - reward Rt indicates how well the agent is doing at step t - the agent should maximize cumulative reward RL based on the reward hypotesis all goals can be described by the maximization of expected cumulative rewards communication in battery free environments - positive rewards if the queried device has new data - else negative #### Challenges: - tradeoff between exploration and exploitation - to obtain a lot of reward a RL agent must prefer action that it tried in the past - but better actions may exist... So the agent has to exploit! ##### exploration vs exploitation dilemma: - comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control - exploitation: we take advanced of the best option we know - exploration: test new decisions ### A general RL framework **at each timestamp the agent:** - executes action At - receives observation Ot - receives scalar reward Rt **the environment:** - receives action At - emits observation Ot - emits scalar reward Rt **agent state:** the view of the agent on the environment state, is a function of history - the history is involved in taking the next decision: - agent selects actions - environment selects observations/rewards - the state information is used to determine what happens next - state is a function of history: $S_t = f(H_t)$ #### Inside the agent one or more of these components - **Policy:** agent's behavior function - defines what to do (behavior at a given time) - maps state to action - core of the RL agent - the policy is altered based on the reward - may be - deterministic: single function of the state - stochastic: specifying probabilities for each actions - reward changes probabilities - **Value function:** - specifies what's good in the long run - is a prediction of future reward - used to evaluate the goodness/badness of states - values are prediction of rewards - $V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]$ - better explained later - **Model:** - predicts what the environment will do next - may predict the resultant next state and/or the next reward - many problems are model free back to the original problem: - n devices - each devices produces new data with rate_i - in which order should the reader query tags? - formulate as an RL problem - agent is the reder - one action per device (query) - rewards: - positive when querying a device with new data - negative if it has no data - what to do if the device has lost data? - state?