master-degree-notes/Autonomous Networking/notes/7 RL.md

4.3 KiB

Case study: battery-free smart home

  • each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...)
  • a device should only transmit when it has new data
    • but in backscattering-based networks they need to be queried by the receiver

In which order should the reader query tags?

  • assume prefixed time slots
  • TDMA with random access performs poorly
  • TDMA with fixed assignment also does (wasted queries)
  • we want to query devices that have new data samples and avoid
    • data loss
    • redundant queries

Goal: design a mac protocol that adapts to all of this. One possibility is to use Reinforcement Learning

Reinforcement learning

How can an intelligent agent learns to make a good sequence of decisions

  • an agent can figure out how the world works by trying things and see what happens
  • is what people and animals do
  • we explore a computational approach to learning from interaction
    • goal-directed learning from interaction

RL is learning what to do, it presents two main characteristics:

  • trial and error search

  • delayed reward

  • sensation, action and goal are the 3 main aspects of a reinforcement learning method

  • a learning agents must be able to

    • sense the state of the environment
    • take actions that affects the state

Difference from other ML

  • no supervisor
  • feedback may be delayed
  • time matters
  • agent action affects future decisions
  • a sequence of successful decisions will result in the process being reinforced
  • RL learns online

Learning online

  • learning while interacting with an ever changing world
  • we expect agents to get things wrong, to refine their understanding as they go
  • the world is not static, agents continuously encounter new situations

Rewards

  • a reward is a scalar feedback signal (a number)
  • reward Rt indicates how well the agent is doing at step t
  • the agent should maximize cumulative reward

RL based on the reward hypotesis all goals can be described by the maximization of expected cumulative rewards

communication in battery free environments

  • positive rewards if the queried device has new data
  • else negative

Challenges:

  • tradeoff between exploration and exploitation
  • to obtain a lot of reward a RL agent must prefer action that it tried in the past
  • but better actions may exist... So the agent has to exploit!
exploration vs exploitation dilemma:
  • comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
  • exploitation: we take advanced of the best option we know
  • exploration: test new decisions

A general RL framework

at each timestamp the agent:

  • executes action At
  • receives observation Ot
  • receives scalar reward Rt

the environment:

  • receives action At
  • emits observation Ot
  • emits scalar reward Rt

agent state: the view of the agent on the environment state, is a function of history

  • the history is involved in taking the next decision:
    • agent selects actions
    • environment selects observations/rewards
  • the state information is used to determine what happens next
    • state is a function of history: S_t = f(H_t)

Inside the agent

one or more of these components

  • Policy: agent's behavior function
    • defines what to do (behavior at a given time)
    • maps state to action
    • core of the RL agent
    • the policy is altered based on the reward
    • may be
      • deterministic: single function of the state
      • stochastic: specifying probabilities for each actions
        • reward changes probabilities
  • Value function:
    • specifies what's good in the long run
    • is a prediction of future reward
    • used to evaluate the goodness/badness of states
    • values are prediction of rewards
    • V_\pi(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]
      • better explained later
  • Model:
    • predicts what the environment will do next
    • may predict the resultant next state and/or the next reward
    • many problems are model free

back to the original problem:

  • n devices
  • each devices produces new data with rate_i
  • in which order should the reader query tags?
  • formulate as an RL problem
    • agent is the reder
    • one action per device (query)
    • rewards:
      • positive when querying a device with new data
      • negative if it has no data
      • what to do if the device has lost data?
    • state?