Back to Home

Samrat Kar

exploring & experimenting

Reinforcement Learning Q&A Notes

Reinforcement Learning Q&A Notes

What does line 12 in test.py do?

In Network.forward, line 12 returns self.linear(x). It applies a linear transformation to input tensor x and returns the raw output (logits).

What is an input tensor here?

Yes, your interpretation is correct:

  • x is the input (single sample or batch).
  • If the network is Network(dim_in, dim_out), then:
    • Input shape is [dim_in] (single sample) or [batch_size, dim_in] (batch).
    • Output shape is [dim_out] (single sample) or [batch_size, dim_out] (batch).

For net = Network(10, 4), input is length 10, output is length 4.

What does env do?

env = gym.make("ALE/SpaceInvaders-V5") creates the game environment (simulator).

You use env to:

  • reset() to start an episode
  • step(action) to apply an action and get next state + reward
  • action_space to know valid actions
  • observation_space to inspect state format
  • close() to release resources

They are connected in a training loop:

  1. env gives current observation/state s
  2. Network processes s and outputs action scores/probabilities
  3. Agent chooses action a
  4. env.step(a) returns s', reward r, and done flag
  5. This experience is used to train the network

Are number of inputs = number of states, and outputs = number of actions?

Almost:

  • Inputs are the size of one state representation (features/pixels), not the number of possible states.
  • Outputs are usually number of actions (especially in DQN for discrete action spaces).

Is policy implemented in the neural network?

Yes.

  • In policy-gradient methods, NN directly outputs policy pi(a|s).
  • In DQN, NN outputs Q(s,a) values, and policy is derived from them (for example epsilon-greedy over argmax Q).

DQN vs policy-based methods

  • DQN:
    • Learns Q(s,a)
    • Indirect policy via max-Q action selection
    • Good for discrete actions
    • Uses replay buffer and target network
  • Policy-based:
    • Learns pi(a|s) directly
    • Works naturally for continuous actions too
    • Optimizes expected return directly
    • Can have higher gradient variance

Full form of DQN

Deep Q-Network

Is Q(s,a) cumulative reward outcome for a state-action pair?

Yes. More precisely, it is expected discounted cumulative future reward after taking action a in state s, then following policy pi.

LaTeX formulas

Action-value function (definition) under policy $\pi$:
\(Q^{\pi}(s,a) = \mathbb{E}_{\pi}\!\left[ \sum_{k=0}^{\infty}\gamma^{k} r_{t+k} \;\middle|\; s_t=s,\; a_t=a \right]\)

Optimal action-value function (definition):
\(Q^{*}(s,a) = \max_{\pi} Q^{\pi}(s,a)\)

Bellman optimality equation (for $Q^{*}$):
\(Q^{*}(s,a) = \mathbb{E}\!\left[ r_t + \gamma \max_{a'} Q^{*}(s_{t+1},a') \;\middle|\; s_t=s,\; a_t=a \right]\)
Why does this equation use $\max_{a’}$ instead of $\max_{\pi}$?

  • In recursive form, maximizing over next actions at each state is equivalent to maximizing over all policies.
  • A policy is a mapping from states to actions, so repeated local maximization over actions induces the globally optimal policy.
  • Therefore, $\max_{\pi}$ is implicit in the Bellman optimality equation through recursive $\max_{a’}$.

Formula notation notes

  • All formulas in this document are written using LaTeX math notation.
  • \left ... \right creates auto-sized delimiters around an expression.
  • \middle adds an in-between delimiter (for example a conditional bar |) with matching size.
  • \; adds medium horizontal spacing in math mode for readability.
  • \gamma is the discount factor with $0 \le \gamma < 1$ (typically). It controls how much future rewards matter relative to immediate rewards.
  • Larger $\gamma$ makes the agent more far-sighted; smaller $\gamma$ makes it more short-sighted.
  • Clarification: higher $\gamma$ means less discounting, not more. Future rewards are weighted by powers of $\gamma$:
    \(r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\)
  • Example: with $\gamma=0.9$, future weights decay as $0.9, 0.81, 0.729, \ldots$; with $\gamma=0.5$, they decay as $0.5, 0.25, 0.125, \ldots$. So larger $\gamma$ keeps future terms larger.
  • $Q^{\pi}(s,a)$ is expected discounted return for state-action (s,a) when following a specific policy \pi.
  • $Q^{*}(s,a)$ is the optimal action-value: the maximum expected discounted return over all policies.
  • $Q^{*}$ is theoretical optimum for the given MDP; practical algorithms usually approximate it.
  • a' (a-prime) is a placeholder for an action at the next state s', used in \max_{a'}.

Bellman expectation over next states

Even though the Bellman optimality equation is written with $s_{t+1}$, it still accounts for all possible future states.

In
\(Q^{*}(s,a) = \mathbb{E}\!\left[ r_t + \gamma \max_{a'} Q^{*}(s_{t+1},a') \;\middle|\; s_t=s,\; a_t=a \right],\)
the expectation is over the transition distribution $P(s_{t+1}, r_t \mid s_t, a_t)$, so all possible next states and rewards are included.

Equivalent explicit form:
\(Q^{*}(s,a) = \sum_{s',r} P(s', r \mid s, a)\left[r + \gamma \max_{a'} Q^{*}(s',a')\right].\)

The recursion inside $Q^{*}(s’,a’)$ then propagates this to all downstream future states.

How recursion creates $\gamma^k$

The exponential discount pattern is a consequence of repeatedly applying the one-step Bellman recursion.

Start from the one-step form (policy version):
\(Q^{\pi}(s_t,a_t) = \mathbb{E}_{\pi}\!\left[r_t + \gamma Q^{\pi}(s_{t+1},a_{t+1}) \;\middle|\; s_t,a_t\right].\)

Apply the same equation to the next term:
\(Q^{\pi}(s_{t+1},a_{t+1}) = \mathbb{E}_{\pi}\!\left[r_{t+1} + \gamma Q^{\pi}(s_{t+2},a_{t+2}) \;\middle|\; s_{t+1},a_{t+1}\right].\)

Substitute back:
\(Q^{\pi}(s_t,a_t) = \mathbb{E}_{\pi}\!\left[r_t + \gamma r_{t+1} + \gamma^2 Q^{\pi}(s_{t+2},a_{t+2}) \;\middle|\; s_t,a_t\right].\)

Repeating this gives:
\(Q^{\pi}(s_t,a_t) = \mathbb{E}_{\pi}\!\left[\sum_{k=0}^{\infty}\gamma^k r_{t+k} \;\middle|\; s_t,a_t\right].\)

So $\gamma$ is a constant discount factor, and the powers $\gamma^k$ appear because each recursion step multiplies by another $\gamma$.

Meaning of the conditioning bar

In
\(Q^{\pi}(s,a) = \mathbb{E}_{\pi}\!\left[\sum_{k=0}^{\infty}\gamma^{k} r_{t+k} \;\middle|\; s_t=s,\; a_t=a\right],\)
the term $\mid s_t=s,\; a_t=a$ is a condition, not a sweep. It fixes the starting point to that specific state-action pair.

The averaging is done by the expectation $\mathbb{E}_{\pi}[\cdot]$, which integrates over possible trajectories induced by the environment dynamics and policy $\pi$.

For optimality, the sweep is represented by
\(\max_{\pi} Q^{\pi}(s,a),\)
or equivalently by recursive action maximization $\max_{a’}$ in the Bellman optimality equation.

Code implementation

loop through all episodes 
    loop through steps in each episode
        at each step choose an action, 
        calculate the loss,
        update the network

Embedded drl_intro_notes.py

"""
loop through all episodes 
    loop through steps in each episode
        at each step choose an action, 
        calculate the loss,
        update the network
"""

import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn

env = gym.make("ALE/SpaceInvaders-V5")
class Network(nn.Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.linear = nn.Linear(dim_in, dim_out)
    def forward(self, x):
        return self.linear(x)

# instantiate the network. nn works as a policy to map states to actions. 
# input is 8 : set of one state representation. here 8 inputs constitute one input state representation. 
# output is number of actions. 4 in this case.
network = Network(8, 4)

# instantiate optimizer 
optimizer = torch.optim.Adam(network.parameters(), lr=0.0001)

## the basic loop 
## iterate through episodes, each episode contains different states and different cumulative reward, defining a trajectory.
for episode in range (1000):
    state, info = env.reset()
    done = False
    ## building one trajectory, of all states in that. looping through all the states in a trajectory. 
    while not done:
        ## select an action using neural network. 
        action = select_action(network, state)
        ## observe the next state and reward from the env
        next_state, reward, terminated, tuncated, _ = (env.step(action))
        done = terminated or tuncated 
        ## calculate loss. agent creates its own training data based on their experience.
        loss = calculate_loss(network, state, action, next_state, reward, done)
        ## update the neural network based on the loss.
        # instantiate optimizer 
        optimizer = torch.optim.Adam(network.parameters(), lr=0.0001)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        state = next_state