master-degree-notes/Autonomous Networking/notes/6.1 RL.md

3.9 KiB

Case study: battery-free smart home

  • each device produces a new data sample with a rate that depends on the environment and the user (continuously, event based / on demand...)
  • a device should only transmit when it has new data
    • but in backscattering-based networks they need to be queried by the receiver

In which order should the reader query tags?

  • assume prefixed timeslots
  • TDMA with random access performs poorly
  • TDMA with fixed assignment also does (wasted queries)
  • we want to query devices that have new data samples and avoid
    • data loss
    • redundant queries

Goal: design a mac protocol that adapts to all of this. One possibility is to use Reinforcement Learning

Reinforcement learning

How can an intelligent agent learns to make a good sequence of decisions

  • an agent can figure out how the world works by trying things and see what happens
  • is what people and animals do
  • we explore a computational approach to learning from interaction
    • goal-directed learning from interaction

RL is learning what to do, it presents two main characteristics:

  • trial and error search

  • delayed reward

  • sensation, action and goal are the 3 main aspects of a reinforcement learning method

  • a learning agents must be able to

    • sense the state of the environment
    • take actions that affects the state

Difference from other ML

  • no supervisor
  • feedback may be delayed
  • time matters
  • agent action affects future decisions
  • ...
  • online learning

Learning online

  • learning while interacting with an ever changing world
  • we expect agents to get things wrong, to refine their understanding as they go
  • the world is not static, agents continuously encounter new situations

RL applications:

  • self driving cars
  • engineering
  • healthcare
  • news recommendation
  • ...

Rewards

  • a reward is a scalar feedback signal (a number)
  • reward Rt indicates how well the agent is doing at step t
  • the agent should maximize cumulative reward

RL based on the reward hypotesis all goals can be described by the maximization of expected cumulative rewards

communication in battery free environments

  • positive rewards if the queried device has new data
  • else negative

Challenge:

  • tradeoff between exploration and exploitation
  • to obtain a lot of reward a RL agent must prefer action that it tried and ...
  • ...

exploration vs exploitation dilemma:

  • comes from incomplete information: we need to gather enough information to make best overall decisions while keeping the risk under control
  • exploitation: we take advanced of the best option we know
  • exploration: test new decisions

A general RL framework

at each timestamp the agent

  • executes action
  • receives observation
  • receives scalar reward

the environment ...

agent state: the view of the agent on the environment state, is a function of history

  • the function of the history is involved in taking the next decision
  • the state representation defines what happens next
  • ...

Inside the agent

one or more of these components

  • Policy: agent's behavior function
    • defines what to do (behavior at a given time)
    • maps state to action
    • core of the RL agent
    • the policy is altered based on the reward
    • may be
      • deterministic: single function of the state
      • stochastic: specifying probabilities for each actions
        • reward changes probabilities
  • Value function:
    • specifies what's good in the long run
    • is a prediction of future reward
    • used to evaluate the goodness/badness of states
    • values are prediction of rewards
    • Vp(s) = Ep[yRt+1 + y^2Rt+2 ... | St = s]
  • Model:
    • predicts what the environment will do next
    • many problems are model free

back to the original problem:

  • n devices
  • each devices produces new data with rate_i
  • in which order should the reader query tags?
  • formulate as an RL problem
    • agent is the reder
    • one action per device (query)
    • rewards:
      • positive when querying a device with new data
      • negative if it has no data
      • what to do if the device has lost data?
    • state?