RL foundations
- input : interaction with environment
- goal : learn a policy (mapping states to actions), that maximizes some utility function.
- policy can be deterministic \pi : S \to A
-
policy can be stochastic \pi(a s) = P(a s) - good policy - when action is beneficial in long run
- value function : Q(s,a) : chances that we will win is higher ==> higher value function. cumulative rewards in long term
Difference Between Reward and Value
In reinforcement learning:
- Reward is the immediate feedback from the environment after an action at time (t), usually written as (r_t).
- Value is the expected long-term return (future cumulative rewards), usually from a state or state-action pair.
Common forms:
- State value:
[
V^\pi(s)=\mathbb{E}\pi\left[\sum{k=0}^\infty \gamma^k r_{t+k+1}\mid s_t=s\right]
] - Action value (Q-value):
[
Q^\pi(s,a)=\mathbb{E}\pi\left[\sum{k=0}^\infty \gamma^k r_{t+k+1}\mid s_t=s,a_t=a\right]
]
So: reward is a one-step signal, while value is the predicted total future reward.
- agent - anything and everything that has absolute control.
- environment - everything else.
-
$\pi(a s)$ - policy, probability of taking action a in state s. -
environment - P(s a) - gives the next state and reward. - terminal state - win, lose, draw. for tic tac toc
- table is the value function - state to value.
- exploration and exploitation - interleave. - is heart of action selection.