Reinforcement Learning Q&A Notes
What does line 12 in test.py do?
In Network.forward, line 12 returns self.linear(x). It applies a linear transformation to input tensor x and returns the raw output (logits).
What is an input tensor here?
Yes, your interpretation is correct:
xis the input (single sample or batch).- If the network is
Network(dim_in, dim_out), then:- Input shape is
[dim_in](single sample) or[batch_size, dim_in](batch). - Output shape is
[dim_out](single sample) or[batch_size, dim_out](batch).
- Input shape is
For net = Network(10, 4), input is length 10, output is length 4.
What does env do?
env = gym.make("ALE/SpaceInvaders-V5") creates the game environment (simulator).
You use env to:
reset()to start an episodestep(action)to apply an action and get next state + rewardaction_spaceto know valid actionsobservation_spaceto inspect state formatclose()to release resources
How is the neural network related to env?
They are connected in a training loop:
envgives current observation/states- Network processes
sand outputs action scores/probabilities - Agent chooses action
a env.step(a)returnss', rewardr, and done flag- This experience is used to train the network
Are number of inputs = number of states, and outputs = number of actions?
Almost:
- Inputs are the size of one state representation (features/pixels), not the number of possible states.
- Outputs are usually number of actions (especially in DQN for discrete action spaces).
Is policy implemented in the neural network?
Yes.
- In policy-gradient methods, NN directly outputs policy
pi(a|s). - In DQN, NN outputs
Q(s,a)values, and policy is derived from them (for example epsilon-greedy overargmax Q).
DQN vs policy-based methods
- DQN:
- Learns
Q(s,a) - Indirect policy via max-Q action selection
- Good for discrete actions
- Uses replay buffer and target network
- Learns
- Policy-based:
- Learns
pi(a|s)directly - Works naturally for continuous actions too
- Optimizes expected return directly
- Can have higher gradient variance
- Learns
Full form of DQN
Deep Q-Network
Is Q(s,a) cumulative reward outcome for a state-action pair?
Yes. More precisely, it is expected discounted cumulative future reward after taking action a in state s, then following policy pi.
LaTeX formulas
Action-value function (definition) under policy $\pi$:
\(Q^{\pi}(s,a) = \mathbb{E}_{\pi}\!\left[ \sum_{k=0}^{\infty}\gamma^{k} r_{t+k} \;\middle|\; s_t=s,\; a_t=a \right]\)
Optimal action-value function (definition):
\(Q^{*}(s,a) = \max_{\pi} Q^{\pi}(s,a)\)
Bellman optimality equation (for $Q^{*}$):
\(Q^{*}(s,a) = \mathbb{E}\!\left[ r_t + \gamma \max_{a'} Q^{*}(s_{t+1},a') \;\middle|\; s_t=s,\; a_t=a \right]\)
Why does this equation use $\max_{a’}$ instead of $\max_{\pi}$?
- In recursive form, maximizing over next actions at each state is equivalent to maximizing over all policies.
- A policy is a mapping from states to actions, so repeated local maximization over actions induces the globally optimal policy.
- Therefore, $\max_{\pi}$ is implicit in the Bellman optimality equation through recursive $\max_{a’}$.
Formula notation notes
- All formulas in this document are written using LaTeX math notation.
\left ... \rightcreates auto-sized delimiters around an expression.\middleadds an in-between delimiter (for example a conditional bar|) with matching size.\;adds medium horizontal spacing in math mode for readability.\gammais the discount factor with $0 \le \gamma < 1$ (typically). It controls how much future rewards matter relative to immediate rewards.- Larger $\gamma$ makes the agent more far-sighted; smaller $\gamma$ makes it more short-sighted.
- Clarification: higher $\gamma$ means less discounting, not more. Future rewards are weighted by powers of $\gamma$:
\(r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots\) - Example: with $\gamma=0.9$, future weights decay as $0.9, 0.81, 0.729, \ldots$; with $\gamma=0.5$, they decay as $0.5, 0.25, 0.125, \ldots$. So larger $\gamma$ keeps future terms larger.
- $Q^{\pi}(s,a)$ is expected discounted return for state-action
(s,a)when following a specific policy\pi. - $Q^{*}(s,a)$ is the optimal action-value: the maximum expected discounted return over all policies.
- $Q^{*}$ is theoretical optimum for the given MDP; practical algorithms usually approximate it.
a'(a-prime) is a placeholder for an action at the next states', used in\max_{a'}.
Bellman expectation over next states
Even though the Bellman optimality equation is written with $s_{t+1}$, it still accounts for all possible future states.
In
\(Q^{*}(s,a) = \mathbb{E}\!\left[ r_t + \gamma \max_{a'} Q^{*}(s_{t+1},a') \;\middle|\; s_t=s,\; a_t=a \right],\)
the expectation is over the transition distribution $P(s_{t+1}, r_t \mid s_t, a_t)$, so all possible next states and rewards are included.
Equivalent explicit form:
\(Q^{*}(s,a) = \sum_{s',r} P(s', r \mid s, a)\left[r + \gamma \max_{a'} Q^{*}(s',a')\right].\)
The recursion inside $Q^{*}(s’,a’)$ then propagates this to all downstream future states.
How recursion creates $\gamma^k$
The exponential discount pattern is a consequence of repeatedly applying the one-step Bellman recursion.
Start from the one-step form (policy version):
\(Q^{\pi}(s_t,a_t) = \mathbb{E}_{\pi}\!\left[r_t + \gamma Q^{\pi}(s_{t+1},a_{t+1}) \;\middle|\; s_t,a_t\right].\)
Apply the same equation to the next term:
\(Q^{\pi}(s_{t+1},a_{t+1}) = \mathbb{E}_{\pi}\!\left[r_{t+1} + \gamma Q^{\pi}(s_{t+2},a_{t+2}) \;\middle|\; s_{t+1},a_{t+1}\right].\)
Substitute back:
\(Q^{\pi}(s_t,a_t)
= \mathbb{E}_{\pi}\!\left[r_t + \gamma r_{t+1} + \gamma^2 Q^{\pi}(s_{t+2},a_{t+2}) \;\middle|\; s_t,a_t\right].\)
Repeating this gives:
\(Q^{\pi}(s_t,a_t)
= \mathbb{E}_{\pi}\!\left[\sum_{k=0}^{\infty}\gamma^k r_{t+k} \;\middle|\; s_t,a_t\right].\)
So $\gamma$ is a constant discount factor, and the powers $\gamma^k$ appear because each recursion step multiplies by another $\gamma$.
Meaning of the conditioning bar
In
\(Q^{\pi}(s,a) = \mathbb{E}_{\pi}\!\left[\sum_{k=0}^{\infty}\gamma^{k} r_{t+k} \;\middle|\; s_t=s,\; a_t=a\right],\)
the term $\mid s_t=s,\; a_t=a$ is a condition, not a sweep. It fixes the starting point to that specific state-action pair.
The averaging is done by the expectation $\mathbb{E}_{\pi}[\cdot]$, which integrates over possible trajectories induced by the environment dynamics and policy $\pi$.
For optimality, the sweep is represented by
\(\max_{\pi} Q^{\pi}(s,a),\)
or equivalently by recursive action maximization $\max_{a’}$ in the Bellman optimality equation.
Code implementation
loop through all episodes
loop through steps in each episode
at each step choose an action,
calculate the loss,
update the network
Embedded drl_intro_notes.py
"""
loop through all episodes
loop through steps in each episode
at each step choose an action,
calculate the loss,
update the network
"""
import gymnasium as gym
import numpy as np
import torch
import torch.nn as nn
env = gym.make("ALE/SpaceInvaders-V5")
class Network(nn.Module):
def __init__(self, dim_in, dim_out):
super().__init__()
self.linear = nn.Linear(dim_in, dim_out)
def forward(self, x):
return self.linear(x)
# instantiate the network. nn works as a policy to map states to actions.
# input is 8 : set of one state representation. here 8 inputs constitute one input state representation.
# output is number of actions. 4 in this case.
network = Network(8, 4)
# instantiate optimizer
optimizer = torch.optim.Adam(network.parameters(), lr=0.0001)
## the basic loop
## iterate through episodes, each episode contains different states and different cumulative reward, defining a trajectory.
for episode in range (1000):
state, info = env.reset()
done = False
## building one trajectory, of all states in that. looping through all the states in a trajectory.
while not done:
## select an action using neural network.
action = select_action(network, state)
## observe the next state and reward from the env
next_state, reward, terminated, tuncated, _ = (env.step(action))
done = terminated or tuncated
## calculate loss. agent creates its own training data based on their experience.
loss = calculate_loss(network, state, action, next_state, reward, done)
## update the neural network based on the loss.
# instantiate optimizer
optimizer = torch.optim.Adam(network.parameters(), lr=0.0001)
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state