Part 1 - Tabular Solution Methods

value functions are represented as arrays or tables.
single state Multi Armed Bandit (MAB)
Markov Decision Process (MDP) - states, actions, transition probabilities, rewards. Bellman and value functions.
3 fundamental classes of solving MDP -
- Dynamic Programming (DP) - requires a complete and accurate model of the environment’s dynamics (transition probabilities and rewards). It uses this model to compute optimal policies through methods like value iteration and policy iteration.
- Monte Carlo (MC) methods - do not require a model of the environment. They learn from experience by sampling episodes of interaction with the environment and using the observed rewards to estimate value functions
- Temporal Difference (TD) learning - also do not require a model of the environment. They learn from experience by updating value estimates based on the observed rewards and the estimated value of the next state, without waiting for the end of an episode.
MAB

tags : [drl, mab, gymnasium]
title : “Multi-Armed Bandit (MAB) Notes”
category: dlr
subcategory: “mab”
layout : mermaid
—

Multi-Armed Bandit Problem - MAB

Key Points

single state. no question of any state change. no discussion on state.
choice of k different actions or options.
after each action agent receives a numerical reward chosen from a stationary probability distribution associated with that action.
each of the k actions has an expected or mean reward given that the action is selected. That is known as the value of the action or action-value, as the reward is chosen from a probability distribution that is associated with that action.
Action selected on time step $t$ as $A_t$ and the corresponding reward is $R_t$
The value of an arbitrary action $a$ is denoted by $q_*(a)$. It is the expected reward given that action $a$ is selected.
$q_*(a) = \mathbb{E}[R_t | A_t = a]$

Examples

Suppose the action set is $\mathcal{A} = {1,2,3}$.
If action $1$ gives reward $1$ with probability $0.7$ and reward $0$ with probability $0.3$, then
$\mathbb{P}(R_t = 1 \mid A_t = 1) = 0.7, \qquad \mathbb{P}(R_t = 0 \mid A_t = 1) = 0.3$
$q_*(1) = \mathbb{E}[R_t \mid A_t = 1] = 1(0.7) + 0(0.3) = 0.7$
Suppose the action set is $\mathcal{A} = {a,b}$.
If action $a$ gives reward $+2$ with probability $0.4$ and reward $-1$ with probability $0.6$, then
$q_*(a) = \mathbb{E}[R_t \mid A_t = a] = 2(0.4) + (-1)(0.6) = 0.2$
Suppose the action set is $\mathcal{A} = {1,2}$.
If action $1$ always gives reward $5$, and action $2$ gives reward $9$ with probability $0.5$ and reward $1$ with probability $0.5$, then
$q_*(1) = 5$
$q_*(2) = 9(0.5) + 1(0.5) = 5$

Even though the reward distributions are different, both actions have the same action-value.
rewards $R_t$ are always probability distributions that are private to the bandit. The action-value is always a scalar quantity.

Key points

If action-value for all actions of a MAB - $q_*(a)$ is known to the agent, then it is trivial to solve k-armed bandit problem, as one will always select the action with highest value.
By design the probability distribution of all the rewards for a given action is not known to the agent. So, the agent can never find the $q_*(a)$ for any action $a$ with certainty.
However, by trial-error, the agent can estimate the action-value of action $a$ at time step $t$ as $Q_t(a)$.

\[Q_t(a)\]

It is the agent’s estimate at time $t$ of the true action-value $q_*(a)$.

One common choice is the sample-average estimate:

\[Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \, \mathbf{1}\{A_i = a\}}{\sum_{i=1}^{t-1} \mathbf{1}\{A_i = a\}}\]

Here, $\mathbf{1}{A_i = a}$ is an indicator function:

\[\mathbf{1}\{A_i = a\} = \begin{cases} 1, & \text{if } A_i = a \\ 0, & \text{if } A_i \ne a \end{cases}\]

So it keeps only those rewards for which action $a$ was actually selected.

Example:

If
$A_1 = 1, \; A_2 = 2, \; A_3 = 1, \; A_4 = 3$
and
$R_1 = 5, \; R_2 = 2, \; R_3 = 7, \; R_4 = 1$
then for action $a = 1$,
$\mathbf{1}\{A_1 = 1\} = 1, \; \mathbf{1}\{A_2 = 1\} = 0, \; \mathbf{1}\{A_3 = 1\} = 1, \; \mathbf{1}\{A_4 = 1\} = 0$

Here we are computing the estimate for action $a = 1$.

In $\mathbf{1}{A_1 = 1}$:

the $1$ inside the braces is the action label
the $\mathbf{1}{\cdot}$ outside is the indicator function

So $\mathbf{1}{A_1 = 1} = 1$ means: at time step $1$, the chosen action was action $1$, so the condition is true, and the indicator returns $1$.

Hence,
$Q_t(1) = \frac{5(1) + 2(0) + 7(1) + 1(0)}{1 + 0 + 1 + 0} = \frac{12}{2} = 6$

Samrat Kar

Deep Reinforcement Learning (DRL) - Big Picture

Part 1 - Tabular Solution Methods

MAB

Multi-Armed Bandit Problem - MAB

Key Points

Examples

Key points