vault backup: 2024-11-06 22:02:11

This commit is contained in:
Marco Realacci 2024-11-06 22:02:11 +01:00
commit fb9be058bd
21 changed files with 170 additions and 23 deletions

View file

@ -51,4 +51,64 @@ Every time a collision is generated, rags randomly increments their counter. The
As each time the tags are split into two sets, we can "see" it as a binary tree, so we can count the node of the tree to get an estimation.
$$BS_{tot}(n)=\begin{cases}1,n\le1\\ 1+\sum_{k=0}^{n}\binom{n}{k}\left(\frac12\right)^{k}\left(1-\frac12\right)^{n-k}\left(BS_{tot}\left(k\right)+BS_{tot}\left(n-k\right)\right),n>1\end{cases}$$
#### Q: explain the differences between proactive and reactive routing in sensor networks. Discuss the advantages and disadvantages
<<<<<<< HEAD
#### Q: explain the differences between proactive and reactive routing in sensor networks. Discuss the advantages and disadvantages
=======
#### Q: explain the differences between proactive and reactive routing in sensor networks. Discuss the advantages and disadvantages
#### Q: Define the agent state and the environment state and explain how these two states differ. Give a practical example
The environment state is the actual state of the environment, is the full description of the current situation. It contains everything related to the environment, regardless of whether the agent is able to observe it. Only a small part of the environment may be observable by the agent.
The agent state is the view of the agent on the environment, is a function of history: $S_{t} = f(H_{t})$
The agent state is used by the policy to take the next decision, based on the history.
The distinction is important as the agent has to learn to make good decisions having limited information.
#### Q: Explain the exploitation-exploration dilemma
The exploitation/exploration dilemma is the problem of finding the best compromise between the two. An agent wants to exploit actions that are known to bring positive rewards, but without exploring, it may never learn which actions are the best, so it also wants to explore. If the agent explores too much tho, it may chose some non optimal actions too many times.
#### Q: Mention and briefly explain three different strategies for action selection in reinforcement learning
**greedy:** the agent always exploits the action with the highest action value
**$\epsilon$-greedy:** the greedy action is selected with $1-\epsilon$ probability. While with $\epsilon$ probability a random action is selected. This helps the agent to explore and find the actions with the best values.
**UCB:** this method is based on the "optimism in the face of uncertainty" principle: if we are unsure about something, we should optimistically assume that is good. For this reason actions are choses not only based on the reward, but also based on the uncertainty of the variance of the reward distribution. For each action, the agent will define a confidence window where it thinks the reward's mean value is located. To explore more actions where the agent is not sure about, the window's upper bound is optimistically considered as the action value.
*riformuliamola meglio, vedi slide*
**Higher initial values:** bla bla bla
#### $\lambda=0.5$ and the following sequence of reward is received
$R_{1}=-1$
$R_{2}=2$
$R_{3}=6$
$R_{4}=3$
$R_{5}=2$
with $T=5$. What are $G_{0}, G_{1}, \dots, G_{5}$?
*Hint: work backwards*
$G_{5} = 0$
$G_{4} = 2$
$G_{3} = 3 + \frac{1}{2} 2 = 4$
$G_{2}=6+\frac{1}{2}4 = 8$
$G_{1}=2+\frac{1}{2}8 = 6$
$G_{0}=-1+\frac{1}{2}6=2$
#### Q: Imagine a network of 10 sensor nodes deployed across an area to monitor environmental conditions, such as temperature, humidity, or pollutant levels. Each sensor node has a different, but unknown, data quality score and battery level what fluctuates due to environmental factors and usage over time. Your goal is to design a strategy that balances exploration and exploitation to maximize cumulative data quality while conserving battery resources.
Actions = {query sensor 1, ..., query sensor 10}
Reward should be a function of data quality and battery level. We consider data quality dq and battery level bl as floating points between 0 and 1.
$R_{t}=\alpha *dq-\beta*bl$, with $\alpha$ and $\beta$ being arbitrary parameters that can be set to define the importance of data quality and battery level.
States: one state
Agent: sampling average
Agent's policy: $\epsilon-greedy$
#### Explain Bellman Expectation Equation for the value of a state $V^\pi$
Basic principle: the value of a state is the expected reward I get exiting from that state plus the discounted value of future states.
*add backup diagram*
$$v_{\pi}(s)=\sum_{a \in A}\pi(a|s)q(a, s)=\sum_{a \in A}\pi(a|s)\left( R_{s}^a+\gamma \sum_{s' \in S}P_{ss'}^av_{\pi}(s') \right)$$
>>>>>>> origin/main