Multi-Armed Bandit Problem - MAB

Key Points

single state. no question of any state change. no discussion on state.
choice of k different actions or options.
after each action agent receives a numerical reward chosen from a stationary probability distribution associated with that action.
each of the k actions has an expected or mean reward given that the action is selected. That is known as the value of the action or action-value, as the reward is chosen from a probability distribution that is associated with that action.
Action selected on time step $t$ as $A_t$ and the corresponding reward is $R_t$
The value of an arbitrary action $a$ is denoted by $q_*(a)$. It is the expected reward given that action $a$ is selected.
$q_*(a) = \mathbb{E}[R_t | A_t = a]$

Examples

Suppose the action set is $\mathcal{A} = {1,2,3}$.
If action $1$ gives reward $1$ with probability $0.7$ and reward $0$ with probability $0.3$, then
$\mathbb{P}(R_t = 1 \mid A_t = 1) = 0.7, \qquad \mathbb{P}(R_t = 0 \mid A_t = 1) = 0.3$
$q_*(1) = \mathbb{E}[R_t \mid A_t = 1] = 1(0.7) + 0(0.3) = 0.7$
Suppose the action set is $\mathcal{A} = {a,b}$.
If action $a$ gives reward $+2$ with probability $0.4$ and reward $-1$ with probability $0.6$, then
$q_*(a) = \mathbb{E}[R_t \mid A_t = a] = 2(0.4) + (-1)(0.6) = 0.2$
Suppose the action set is $\mathcal{A} = {1,2}$.
If action $1$ always gives reward $5$, and action $2$ gives reward $9$ with probability $0.5$ and reward $1$ with probability $0.5$, then
$q_*(1) = 5$
$q_*(2) = 9(0.5) + 1(0.5) = 5$

Even though the reward distributions are different, both actions have the same action-value.
rewards $R_t$ are always probability distributions that are private to the bandit. The action-value is always a scalar quantity.

Key points

If action-value for all actions of a MAB - $q_*(a)$ is known to the agent, then it is trivial to solve k-armed bandit problem, as one will always select the action with highest value.
By design the probability distribution of all the rewards for a given action is not known to the agent. So, the agent can never find the $q_*(a)$ for any action $a$ with certainty.
However, by trial-error, the agent can estimate the action-value of action $a$ at time step $t$ as $Q_t(a)$.

\[Q_t(a)\]

It is the agent’s estimate at time $t$ of the true action-value $q_*(a)$.

One common choice is the sample-average estimate:

\[Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \, \mathbf{1}\{A_i = a\}}{\sum_{i=1}^{t-1} \mathbf{1}\{A_i = a\}}\]

Here, $\mathbf{1}{A_i = a}$ is an indicator function:

\[\mathbf{1}\{A_i = a\} = \begin{cases} 1, & \text{if } A_i = a \\ 0, & \text{if } A_i \ne a \end{cases}\]

So it keeps only those rewards for which action $a$ was actually selected.

Example:

If
$A_1 = 1, \; A_2 = 2, \; A_3 = 1, \; A_4 = 3$
and
$R_1 = 5, \; R_2 = 2, \; R_3 = 7, \; R_4 = 1$
then for action $a = 1$,
$\mathbf{1}\{A_1 = 1\} = 1, \; \mathbf{1}\{A_2 = 1\} = 0, \; \mathbf{1}\{A_3 = 1\} = 1, \; \mathbf{1}\{A_4 = 1\} = 0$

Here we are computing the estimate for action $a = 1$.

In $\mathbf{1}{A_1 = 1}$:

the $1$ inside the braces is the action label
the $\mathbf{1}{\cdot}$ outside is the indicator function

So $\mathbf{1}{A_1 = 1} = 1$ means: at time step $1$, the chosen action was action $1$, so the condition is true, and the indicator returns $1$.

Hence,
$Q_t(1) = \frac{5(1) + 2(0) + 7(1) + 1(0)}{1 + 0 + 1 + 0} = \frac{12}{2} = 6$

Samrat Kar

Multi-Armed Bandit (MAB) Notes

Multi-Armed Bandit Problem - MAB

Key Points

Examples

Key points